Hi Ben and Satra, So Jerome and I are trying to finalize updating our dataset, and we've now encountered a new error: 0 CRITICAL

0.0 .\: check_unique_identifiers - 'NWBFile' object at location '/' Message: The identifier '758519303' is used across the .nwb files: ['sub-408021_ses-20180926T172917_behavior+image+ophys.nwb', 'sub-408021_ses-20180926T172917_behavior+ophys.nwb']. The identifier of any NWBFile should be a completely unique value - we recommend using uuid4 to achieve this. [6:03 AM] https://pynwb.readthedocs.io/en/stable/tutorials/general/object_id.html This seems new to me? So we have the same identified for 2 sessions, because we are doing this versioning thing (version with and without stimulus). Do you have recommendations for a smart way to comply with this?

1 reply Today at 7:26 AMView thread white_check_mark eyes raised_hands

Colleen 4:09 AM Side note: I have a Master's student working with me in my new lab, and I proposed that we run analyses on a Dandi dataset :slightly_smiling_face: We have encountered this same difficulty, where each file in our dataset of interest has the full imaging stack, accounting for 28 out of 29GB, each time. We are hoping to use the data locally - not just through streaming. To enable this, without having to download 2TB of data, I wrote some code that sequentially streams each file, pops out the acquisition stack, and saves just the remainder locally (generally < 1 GB per file). It's not the fastest, but it works. So, just based on my experience, I would say that support for partial downloading or multiple versions may prove useful for a lot of users! This is why this solution I original settled on (multiple versions for the same session) is still important to me. However, I am obviously no expert, so your thoughts or advice on this as a general principle are very welcome!

satra 6:15 AM @Colleen

i think the first question may be that both of those assets point to the same file. can you check the checksum of the assets? this may have happened because of partial uploads. i know there were some changes made to the organize step to take this into account. the new change to the CLI allows you to force a consistent naming pattern instead of asking it to disambiguate. re: second - any reason this person doesn’t want to use the hub? also, if you use streaming with caching, that effectively should work like you said, except all your local data is ephemeral. it would also mean that anyone else would be able to run the same analysis, since the source is the same. i know this is slightly different from what you are asking, but trying to understand the rationale for not using certain approaches. and finally let us know what happens with the reuse of data. we are very much interested in seeing that happen more :slightly_smiling_face:

Colleen 8:13 AM So, the student is just following my lead. For my part, I still haven't gotten to the point in my workflows where I expect or count on continuous, uninterrupted internet connection. I think you're right, it's a good idea to migrate to remote, container-based work, but - I'm just a bit of a slow adopter! But, trying to get in the habit of trying new things earlier! 8:14 And I will definitely keep you apprised! I mention Dandi widely to people. I think it is very cool that the recent Science paper by Jeong and colleagues that has been getting a lot of buzz has its dataset on Dandi! Just saw that today :slightly_smiling_face: :+1: 1

satra 8:19 AM please let us know anything you run into. we want to make this for folks like you.

1 reply Today at 8:22 AMView thread

Colleen 8:22 AM So for the identifier, responding to @Ben Dichter . I see - so when we originally created the dataset, we didn't think about identifier vs session id. I see the logic. I would have to add session_id to both sets of files, and @Jerome Lecoq would have to repush all of them. At this point, I think it's getting a bit out of hand it terms of repeated work and workload. I would prefer to take a bit of a shortcut, omit the session_id tag, and just tag these files as {session_id}_no_stim . I think it might be the easiest way, as both Jerome and I are, I think, very keen to be all done with this dataset. (edited) 8:24 @satra Jerome is actually doing all the data pushing, as I don't have the capabilities to handle that much data. I'm not sure if it's a partial upload problem. I suspect not, though, but we can check @Jerome Lecoq ?

Jerome Lecoq 10:08 AM mmm so I checked the file, and there is a .identifier and and a .object_id 10:08 the .identifier is the one that is using our internal "sesssion id" coming from the institute database and is uniquely associated with the experimental day 10:08 just checking we are using the right terminology 10:09 .object_id looks to be randomly generated in our files 10:09 The error seems to suggest that .identifier needs to be different for every files? (edited) 10:10 OR, you are saying we should move the content of our current .identifier to session_id and give a random string to .identifier? But then, what is object_id used for ? (edited)

Ben Dichter 10:35 AM object_id is generated by PyNWB for every neurodata object, including NWBFile. You don’t have the option to set it. You do have the option to set identifier and session_id. identifier should be unique for each file, but unlike object_id it can be set manually to match some unique ID in your lab’s internal database. session_id should be the unique identifier of the session, and should be the same across NWB files from the same session. I’ll grant you that there is a bit of redundancy between identifier and object_id. If we were to make it so you could manually set object_id then you should use it for both purposes, but we do not because we internally rely on object_id being unique. We use its uniqueness in e.g. the external resources extension, where we use it to address unique neurodata objects in the NWB file. see the best practices on session_id and identifier here. object_id is not discussed, since it is not settable by the user anyway, though I could see that maybe mentioning it here would be useful to avoid confusion. (edited)

Jerome Lecoq 10:40 AM You meant "but we do not because we internally rely on identifierobject_id being unique." ?

1 reply Today at 10:45 AMView thread

Jerome Lecoq 10:41 AM If we are to follow this to the letter, it means updating old files, if I understand correctly

Colleen 10:41 AM So Jerome - what do you think - if I write an amendment script, are you willing to reupload all the files? Another question is - can I amend these things in place or will I break the files?

Jerome Lecoq 10:42 AM I mean, we could download the entire dataset, fix the files and re-upload? 10:42 will that works?

Colleen 10:43 AM well, we would want to use the ones you have locally, to be sure - cause we have a mix of old and new on Dandi -

Jerome Lecoq 10:43 AM ah yes 10:43 lol

Colleen 10:43 AM it's a bit of a mess, but I'm trying to hold onto the details in my head...!

Jerome Lecoq 10:43 AM double brain bow

Colleen 10:44 AM ok - I think we do that? @Ben Dichter any chance I can amend the object ID in the files? (edited)

Jerome Lecoq 10:44 AM perhaps we should just fix the new files? 10:45 does your analysis code uses this?

Ben Dichter 10:47 AM I would do it with h5py, not pynwb. It should only be a few lines.

Jerome Lecoq 10:51 AM on it

Colleen 10:52 AM My analysis code barely uses it - very easy to handle 10:52 Shall I write a little script, Jerome? 10:54 I would set session_id for all the files, and the object identifier as... {session_id} for the small version and {session_id}_with_stim for the version with stim. 10:55 I prefer that way, as the small versions are sort of the default/most useful. New

Colleen 11:13 AM I've almost got a script - testing it.