Closed jeromelecoq closed 1 year ago
0.0 .\: check_unique_identifiers - 'NWBFile' object at location '/' Message: The identifier '758519303' is used across the .nwb files: ['sub-408021_ses-20180926T172917_behavior+image+ophys.nwb', 'sub-408021_ses-20180926T172917_behavior+ophys.nwb']. The identifier of any NWBFile should be a completely unique value - we recommend using uuid4 to achieve this. [6:03 AM] https://pynwb.readthedocs.io/en/stable/tutorials/general/object_id.html This seems new to me? So we have the same identified for 2 sessions, because we are doing this versioning thing (version with and without stimulus). Do you have recommendations for a smart way to comply with this?
1 reply Today at 7:26 AMView thread white_check_mark eyes raised_hands
Colleen 4:09 AM Side note: I have a Master's student working with me in my new lab, and I proposed that we run analyses on a Dandi dataset :slightly_smiling_face: We have encountered this same difficulty, where each file in our dataset of interest has the full imaging stack, accounting for 28 out of 29GB, each time. We are hoping to use the data locally - not just through streaming. To enable this, without having to download 2TB of data, I wrote some code that sequentially streams each file, pops out the acquisition stack, and saves just the remainder locally (generally < 1 GB per file). It's not the fastest, but it works. So, just based on my experience, I would say that support for partial downloading or multiple versions may prove useful for a lot of users! This is why this solution I original settled on (multiple versions for the same session) is still important to me. However, I am obviously no expert, so your thoughts or advice on this as a general principle are very welcome!
Colleen 8:13 AM So, the student is just following my lead. For my part, I still haven't gotten to the point in my workflows where I expect or count on continuous, uninterrupted internet connection. I think you're right, it's a good idea to migrate to remote, container-based work, but - I'm just a bit of a slow adopter! But, trying to get in the habit of trying new things earlier! 8:14 And I will definitely keep you apprised! I mention Dandi widely to people. I think it is very cool that the recent Science paper by Jeong and colleagues that has been getting a lot of buzz has its dataset on Dandi! Just saw that today :slightly_smiling_face: :+1: 1
satra 8:19 AM please let us know anything you run into. we want to make this for folks like you.
1 reply Today at 8:22 AMView thread
Colleen 8:22 AM So for the identifier, responding to @Ben Dichter . I see - so when we originally created the dataset, we didn't think about identifier vs session id. I see the logic. I would have to add session_id to both sets of files, and @Jerome Lecoq would have to repush all of them. At this point, I think it's getting a bit out of hand it terms of repeated work and workload. I would prefer to take a bit of a shortcut, omit the session_id tag, and just tag these files as {session_id}_no_stim . I think it might be the easiest way, as both Jerome and I are, I think, very keen to be all done with this dataset. (edited) 8:24 @satra Jerome is actually doing all the data pushing, as I don't have the capabilities to handle that much data. I'm not sure if it's a partial upload problem. I suspect not, though, but we can check @Jerome Lecoq ?
Jerome Lecoq 10:08 AM mmm so I checked the file, and there is a .identifier and and a .object_id 10:08 the .identifier is the one that is using our internal "sesssion id" coming from the institute database and is uniquely associated with the experimental day 10:08 just checking we are using the right terminology 10:09 .object_id looks to be randomly generated in our files 10:09 The error seems to suggest that .identifier needs to be different for every files? (edited) 10:10 OR, you are saying we should move the content of our current .identifier to session_id and give a random string to .identifier? But then, what is object_id used for ? (edited)
Ben Dichter 10:35 AM object_id is generated by PyNWB for every neurodata object, including NWBFile. You don’t have the option to set it. You do have the option to set identifier and session_id. identifier should be unique for each file, but unlike object_id it can be set manually to match some unique ID in your lab’s internal database. session_id should be the unique identifier of the session, and should be the same across NWB files from the same session. I’ll grant you that there is a bit of redundancy between identifier and object_id. If we were to make it so you could manually set object_id then you should use it for both purposes, but we do not because we internally rely on object_id being unique. We use its uniqueness in e.g. the external resources extension, where we use it to address unique neurodata objects in the NWB file. see the best practices on session_id and identifier here. object_id is not discussed, since it is not settable by the user anyway, though I could see that maybe mentioning it here would be useful to avoid confusion. (edited)
Jerome Lecoq 10:40 AM You meant "but we do not because we internally rely on identifierobject_id being unique." ?
1 reply Today at 10:45 AMView thread
Jerome Lecoq 10:41 AM If we are to follow this to the letter, it means updating old files, if I understand correctly
Colleen 10:41 AM So Jerome - what do you think - if I write an amendment script, are you willing to reupload all the files? Another question is - can I amend these things in place or will I break the files?
Jerome Lecoq 10:42 AM I mean, we could download the entire dataset, fix the files and re-upload? 10:42 will that works?
Colleen 10:43 AM well, we would want to use the ones you have locally, to be sure - cause we have a mix of old and new on Dandi -
Jerome Lecoq 10:43 AM ah yes 10:43 lol
Colleen 10:43 AM it's a bit of a mess, but I'm trying to hold onto the details in my head...!
Jerome Lecoq 10:43 AM double brain bow
Colleen 10:44 AM ok - I think we do that? @Ben Dichter any chance I can amend the object ID in the files? (edited)
Jerome Lecoq 10:44 AM perhaps we should just fix the new files? 10:45 does your analysis code uses this?
Ben Dichter 10:47 AM I would do it with h5py, not pynwb. It should only be a few lines.
Jerome Lecoq 10:51 AM on it
Colleen 10:52 AM My analysis code barely uses it - very easy to handle 10:52 Shall I write a little script, Jerome? 10:54 I would set session_id for all the files, and the object identifier as... {session_id} for the small version and {session_id}_with_stim for the version with stim. 10:55 I prefer that way, as the small versions are sort of the default/most useful. New
Colleen 11:13 AM I've almost got a script - testing it.
This works with the usage of dandi organize, though it does not add a _raw to the file name of sessions with raw data. Yarik from the dandi team suggested using a processing label within the file to indicate that, so I am looking into that.
There may be a way to use this to handle #103
@Ahad-Allen will create an issue for this
Tracked with an issue at https://github.com/dandi/dandi-cli/issues/1235
Yarik from the dandi team responded to the issue and seemed to be totally open to adding the acq label from bids into the nwb naming convention. This label is intended to be used for custom identifying information between sessions of the same modality and should be sufficient to add experiment ids and raw to file names once it is done.
Reached out to yarik about a timeline on adding the acq label.
Yarik responded and it seems like there is a way to use the acq label to add the info we need! I will work on this now.
@Ahad-Allen to add a PR to this
@Ahad-Allen will check that files are not corrupted. @rcpeene had some importing issues.
@rcpeene mentioned that the importing issue seems to be gone.
In discussing uploading a variant of NWB files, we encounter an issue with duplication. This was described in more depth here: https://nwbinspector.readthedocs.io/en/dev/best_practices/nwbfile_metadata.html#file-identifiers