dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 27 forks source link

simpler short-term BIDS metadata extraction #772

Open satra opened 3 years ago

satra commented 3 years ago

@yarikoptic - i wanted to bring this up a little bit to ensure that we can do a bit more with bids metadata extraction before full fledged bids support #432.

can we simply extract the metadata from a file path when possible (identifiers for subject, session, sample, modality-via-folder, technique-via-suffix, when available)?

@djarecka - is this something you could take a quick look at? dandiset 108 and 26 should provide a fair bit of examples and you can play with the path names of files using the API. this should return some structured metadata for the non-NWB part of the processing of a file. it should look like the session metadata and the subject metadata info we extract from NWB.

djarecka commented 3 years ago

@TheChymera -ping

djarecka commented 3 years ago

@satra - is it ok to add pybids as a new requirement?

satra commented 3 years ago

i would stay away from that for the moment mostly because it's a very heavy requirement. since you are parsing the filepath for a single file at the moment, a simple function + a dictionary for modality/technique mapping may work for the moment.

but eventually we may probably need pybids as an optional dependency.

djarecka commented 3 years ago

@satra - ok, so I understand I could add specific parsing and assume that this will not change much with new versions etc. (basically ignoring the version info)

satra commented 3 years ago

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

so overall things to extract and inject into metadata are:

we should perhaps start subclassing assets in dandischema soonish.

djarecka commented 3 years ago

@satra - I talked to @TheChymera and he will work on the extraction this info from the BIDS files, but it is indeed not clear to me where we will add the info. Should we create a new schema first?

satra commented 3 years ago

the current asset schema handles all of these fields.

here is an example nwb asset metadata: https://api.dandiarchive.org/api/dandisets/000008/versions/draft/assets/d387991e-f9c5-4790-9152-75885ddf5572/

yarikoptic commented 3 years ago

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

I would say: detect presence of dataset_description.json file containing BIDSVersion field.

but eventually we may probably need pybids as an optional dependency.

most likely only if it gets lighter -- now it has pandas dependency which brings in lots... there are discussions about some pybids-lite or alike. Also - "schema" component of BIDS specification might get its own library to load/parse/validate (relevant discussion: https://github.com/bids-standard/bids-specification/issues/543) although it might indeed just get implemented within pybids.

djarecka commented 3 years ago

@TheChymera - are you still working on this?

TheChymera commented 3 years ago

@djarecka hi, sorry, no, I got stuck in trying to get some sort of MWE (or already working example) to extract metadata from, and derive from there. I think I only just realized while working on some ephys data that conversion to NWB is different from DANDI metadata extraction (or at least it's provided by another package there). If you could share some sort of example of what we're working on, I could continue from there. Are we working on dandi register /path/to/bids or dandi validate /path/to/bids, or something else?

TheChymera commented 2 years ago

Currently building an XS regex-based BIDS validator as part of the new BIDS schemacode package. The schema directory is parameterized so this can be pointed to our own version, if we have nonstandard datasets (e.g. DANDI:000108).

https://github.com/bids-standard/bids-specification/issues/543#issuecomment-1007683914

@satra do you know whether our 108 dataset has a schema version which they use, or is it:

  1. ad-hoc naming and
  2. we know that the dataset is valid.

So we should just edit the schema to fit the dataset?

There is a microscopy PR in BIDS, which looks like it might be merged soon, but our dataset does not, in fact, use it.

@djarecka

satra commented 2 years ago

it should technically correspond to that PR except that it will have ngff files. everything else should be aligned with that PR though. filenames should not be ad hoc even now, except for the h5 extension.

TheChymera commented 2 years ago

@satra sadly it does not, currently nothing validates as the datatype is listed as micr by the PR, whereas all files use a microscopy directory for the modalities. There may be other issues as well, but for the time being it seems we would need a divergent schema, unless we want to fix the data.

Or we could also comment on the extension PR, but micr seems to make a lot more sense, analogously to anat and func.

satra commented 2 years ago

that should be fixed (but that PR only recently changed from microscopy to micr). i'm pretty sure @leekamentsky was going to change it, but i would suggest waiting till we get zarr upload into the CLI, and then rewrite that dataset with the ngff files, which themselves likely need to be updated to the upcoming spec. it's a large dataset and we can wait a bit to do fixes together.

we will also need additional dandi specific overrides. ngff is not directly supported by the PR as well since that spec is evolving, but we will support it in dandi.

TheChymera commented 2 years ago

@satra it turns out there are quite a few more inconsistencies between BEP031 and DANDI:000108 . These are the one I've picked up on thus far, but there might be even more, particularly with regard to entity order and suffix case: https://github.com/TheChymera/bids-schemadata/commits/dandi

yarikoptic commented 2 years ago

I think some should be fixed (renamed) in the dandiset itself, e.g.

So I think only the addition of .h5 should be the customization for us to have. @satra, agree?

We will need to coordinate with @LeeKamentsky on renaming: do you have an entire copy of the dandiset with data as it was uploaded?

satra commented 2 years ago

this will be reuploaded as ngff(see https://github.com/dandi/dandi-cli/issues/772#issuecomment-1008487332), so the customization should be for ngff not .h5. btw, the bids specification doesn't say anything about case sensitivity of the suffix (and there doesn't seem to be consistency of lower and upper cases - bold is lower which is also an acronym). my preference would be to not introduce case.

LeeKamentsky commented 2 years ago

I've got both the NGFF and HDF5 stored locally. I can reorganize the file structure (microscopy -> micr) and rewrite both the NGFF and sidecar metadata as part of the update procedure when we get to the point where we switch from HDF5 to NGFF.

yarikoptic commented 2 years ago

That would be great @LeeKamentsky ! So for now we can validate using "adjusted schema" and then before upload of ngff switch to the schema which has only "allow .ngff" modification in.

TheChymera commented 2 years ago

@satra regarding case, going by general unix case handling, the case gets parsed as-written from the YAML (both for the online listing currently done in bids-specification, as well as for the validator I'm writing). This means that for it to be case-insensitive both upper- and lower-case strings would need to be listed. I see no single suffix doing this, so if you want it implemented as case-insensitive it should be done at the validator level, but that would be nonstandard behaviour, so I'm not sure it's a good idea...

The other thing I can do is write a comment on the BEP for it to be made lower-case, which sounds most consitent with BIDS, and include the change ahead of time in our bids-schemadata:DANDI branch. In fact I think I might do it in any case.

TheChymera commented 2 years ago

Let's see what they think about this: https://github.com/bids-standard/bids-specification/pull/881#issuecomment-1013702512

TheChymera commented 2 years ago

@satra ok, so it appears there's no good reason to make it lowercase, since there are in fact quite a few uppercase suffixes. I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

TheChymera commented 2 years ago

Also, apparently README and CHANGES are required as per BIDS :( I really don't see the point for CHANGES, but that seems to be the situation. Other than these two files (and sessions.tsv), DANDI:000108 seems to validate: https://ppb.chymera.eu/1d5ef8.log

satra commented 2 years ago

CHANGES is indeed a little weird, especially if there are no changes to start with. but we should be supportive of whatever the bids spec entails. could you also test the validator on 000026 (it has a lot of files)?

satra commented 2 years ago

I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

yes. please create the validator with that in mind + support for ngff directories. please let @LeeKamentsky know how best to test the dataset with the validator, perhaps before it gets into dandi cli (or if that is imminent, then we can wait to just execute the validate function).

TheChymera commented 2 years ago

test the validator on 000026

@satra oh my, there are many (extra) standard divergences here as well. The one where I think we most need feedback is the _fa-<index> entity. It seems this is used interchangeably by the dataset with _flip-<index>, but it could also be something else entirely. See https://github.com/TheChymera/bids-schemadata/commit/bebd54f4d85474971e78323492990b491a38defc for the changes required to make it pass. Detrimentally to standard integrity, flip needs to be downgraded to optional as well... Are you (we?) in charge of updating the data now, or should I be contacting @gmazzamuto directly?

TheChymera commented 2 years ago

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

TheChymera commented 2 years ago

Additionally, there seems to be some microscopy data which is not using the microscopy/ datatype subdirectory, e.g. /sub-I46/ses-SPIM/sub-I46_ses-SPIM_sample-*.

satra commented 2 years ago

@TheChymera - we should give the contact person a heads up, but i would suggest first making sure that the validator is integrated so that they can check. one thing to note is that each site has a different subset of the data. no site has all the data, and it also doesn't make sense to force people to download all the data. so that may be something additional to look into for the validator.

yarikoptic commented 2 years ago

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

that just shows the power of standards and validators -- ability to detect presence of such typos ;) yeah, needs to be fixed in the dataset itself.

LeeKamentsky commented 2 years ago

I can remove sub-MITU01_sessions.tsv - I guess I missed noticing the point where that file disappeared from the spec. I'll add a README and see if I can reconstruct and backfill the CHANGES file

satra commented 2 years ago

@LeeKamentsky - it hasn't disappeared. it's just an optional file: https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#sessions-file (i wouldn't remove it).

TheChymera commented 2 years ago

I can remove sub-MITU01_sessions.tsv

@LeeKamentsky I don't think it has disappeared from the spec, just that it's not represented in the YAML schema data which we use for the validator. So those files are not the issue. But the typo with the underscore and microscopy subdirectory should be fixed at some point.

gmazzamuto commented 2 years ago

Hello everyone!

I had a nice chat with @TheChymera yesterday, we'll fix the ses-SPIM directories as soon as we have some time. For the other image modalities I think you can contact

TheChymera commented 2 years ago

@satra do we have individual issue trackers for each dataset? I remember I saw something like that once, but can't find it again. It might be better to discuss updates there since there seem to be a lot of different things going on in this thread. Ideally we could discuss:

  1. Things which should be changed in DANDI:000108
  2. Things which should be changed in DANDI:000026
  3. Things which we fix in the schema and keep fixing in the schema until our draft fixes are deprecated by vanilla BIDS.

separately.

satra commented 2 years ago

you can do it in the github.com/dandisets org where each dandiset has it's own repo.

TheChymera commented 2 years ago

@gmazzamuto can you send me the contact details per email, or ping the github accounts (I searched but don't think I could find them).

TheChymera commented 2 years ago

Issues with the datasets are now tracked here: https://github.com/dandisets/000108/issues https://github.com/dandisets/000026/issues

Thus the only relevant schema changes remain:

TheChymera commented 2 years ago

Ok, so we have a new schema version containing only changes which enhance but do not diverge from BIDS. Almost all of both datasets fail with this schema, pending issues reported on them. https://github.com/TheChymera/bids-schemadata/tree/dandi_strict