Open satra opened 3 years ago
@TheChymera -ping
@satra - is it ok to add pybids as a new requirement?
i would stay away from that for the moment mostly because it's a very heavy requirement. since you are parsing the filepath for a single file at the moment, a simple function + a dictionary for modality/technique mapping may work for the moment.
but eventually we may probably need pybids as an optional dependency.
@satra - ok, so I understand I could add specific parsing and assume that this will not change much with new versions etc. (basically ignoring the version info)
a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub-
in filename, then:
participants.tsv
(normal bids doesn't require this)sample-
in filename it should also require a samples.tsv
age
, sex
, and species
and tissue_type
so overall things to extract and inject into metadata are:
we should perhaps start subclassing assets in dandischema soonish.
@satra - I talked to @TheChymera and he will work on the extraction this info from the BIDS files, but it is indeed not clear to me where we will add the info. Should we create a new schema first?
the current asset schema handles all of these fields.
here is an example nwb asset metadata: https://api.dandiarchive.org/api/dandisets/000008/versions/draft/assets/d387991e-f9c5-4790-9152-75885ddf5572/
a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a
sub-
in filename, then:
I would say: detect presence of dataset_description.json
file containing BIDSVersion
field.
but eventually we may probably need pybids as an optional dependency.
most likely only if it gets lighter -- now it has pandas dependency which brings in lots... there are discussions about some pybids-lite
or alike. Also - "schema" component of BIDS specification might get its own library to load/parse/validate (relevant discussion: https://github.com/bids-standard/bids-specification/issues/543) although it might indeed just get implemented within pybids.
@TheChymera - are you still working on this?
@djarecka hi, sorry, no, I got stuck in trying to get some sort of MWE (or already working example) to extract metadata from, and derive from there. I think I only just realized while working on some ephys data that conversion to NWB is different from DANDI metadata extraction (or at least it's provided by another package there). If you could share some sort of example of what we're working on, I could continue from there. Are we working on dandi register /path/to/bids
or dandi validate /path/to/bids
, or something else?
Currently building an XS regex-based BIDS validator as part of the new BIDS schemacode package. The schema directory is parameterized so this can be pointed to our own version, if we have nonstandard datasets (e.g. DANDI:000108
).
https://github.com/bids-standard/bids-specification/issues/543#issuecomment-1007683914
@satra do you know whether our 108 dataset has a schema version which they use, or is it:
So we should just edit the schema to fit the dataset?
There is a microscopy PR in BIDS, which looks like it might be merged soon, but our dataset does not, in fact, use it.
@djarecka
it should technically correspond to that PR except that it will have ngff files. everything else should be aligned with that PR though. filenames should not be ad hoc even now, except for the h5 extension.
@satra sadly it does not, currently nothing validates as the datatype is listed as micr
by the PR, whereas all files use a microscopy
directory for the modalities. There may be other issues as well, but for the time being it seems we would need a divergent schema, unless we want to fix the data.
Or we could also comment on the extension PR, but micr
seems to make a lot more sense, analogously to anat
and func
.
that should be fixed (but that PR only recently changed from microscopy
to micr
). i'm pretty sure @leekamentsky was going to change it, but i would suggest waiting till we get zarr upload into the CLI, and then rewrite that dataset with the ngff files, which themselves likely need to be updated to the upcoming spec. it's a large dataset and we can wait a bit to do fixes together.
we will also need additional dandi specific overrides. ngff is not directly supported by the PR as well since that spec is evolving, but we will support it in dandi.
@satra it turns out there are quite a few more inconsistencies between BEP031 and DANDI:000108 . These are the one I've picked up on thus far, but there might be even more, particularly with regard to entity order and suffix case: https://github.com/TheChymera/bids-schemadata/commits/dandi
I think some should be fixed (renamed) in the dandiset itself, e.g.
spim -> SPIM
run
: although it sounds sensible to add/recommend to BEP, since it is always _run-1
I think we should rename that toomicr
not microscopy
So I think only the addition of .h5
should be the customization for us to have. @satra, agree?
We will need to coordinate with @LeeKamentsky on renaming: do you have an entire copy of the dandiset with data as it was uploaded?
this will be reuploaded as ngff
(see https://github.com/dandi/dandi-cli/issues/772#issuecomment-1008487332), so the customization should be for ngff
not .h5
. btw, the bids specification doesn't say anything about case sensitivity of the suffix (and there doesn't seem to be consistency of lower and upper cases - bold
is lower which is also an acronym). my preference would be to not introduce case.
I've got both the NGFF and HDF5 stored locally. I can reorganize the file structure (microscopy -> micr) and rewrite both the NGFF and sidecar metadata as part of the update procedure when we get to the point where we switch from HDF5 to NGFF.
That would be great @LeeKamentsky ! So for now we can validate using "adjusted schema" and then before upload of ngff switch to the schema which has only "allow .ngff" modification in.
@satra regarding case, going by general unix case handling, the case gets parsed as-written from the YAML (both for the online listing currently done in bids-specification, as well as for the validator I'm writing). This means that for it to be case-insensitive both upper- and lower-case strings would need to be listed. I see no single suffix doing this, so if you want it implemented as case-insensitive it should be done at the validator level, but that would be nonstandard behaviour, so I'm not sure it's a good idea...
The other thing I can do is write a comment on the BEP for it to be made lower-case, which sounds most consitent with BIDS, and include the change ahead of time in our bids-schemadata:DANDI
branch. In fact I think I might do it in any case.
Let's see what they think about this: https://github.com/bids-standard/bids-specification/pull/881#issuecomment-1013702512
@satra ok, so it appears there's no good reason to make it lowercase, since there are in fact quite a few uppercase suffixes. I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?
Also, apparently README
and CHANGES
are required as per BIDS :( I really don't see the point for CHANGES, but that seems to be the situation. Other than these two files (and sessions.tsv
), DANDI:000108
seems to validate: https://ppb.chymera.eu/1d5ef8.log
CHANGES
is indeed a little weird, especially if there are no changes to start with. but we should be supportive of whatever the bids spec entails. could you also test the validator on 000026 (it has a lot of files)?
I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?
yes. please create the validator with that in mind + support for ngff directories. please let @LeeKamentsky know how best to test the dataset with the validator, perhaps before it gets into dandi cli (or if that is imminent, then we can wait to just execute the validate function).
test the validator on 000026
@satra oh my, there are many (extra) standard divergences here as well.
The one where I think we most need feedback is the _fa-<index>
entity.
It seems this is used interchangeably by the dataset with _flip-<index>
, but it could also be something else entirely.
See https://github.com/TheChymera/bids-schemadata/commit/bebd54f4d85474971e78323492990b491a38defc for the changes required to make it pass.
Detrimentally to standard integrity, flip
needs to be downgraded to optional as well...
Are you (we?) in charge of updating the data now, or should I be contacting @gmazzamuto directly?
I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json
(in DANDI:000026
). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.
Additionally, there seems to be some microscopy data which is not using the microscopy/
datatype subdirectory, e.g. /sub-I46/ses-SPIM/sub-I46_ses-SPIM_sample-*
.
@TheChymera - we should give the contact person a heads up, but i would suggest first making sure that the validator is integrated so that they can check. one thing to note is that each site has a different subset of the data. no site has all the data, and it also doesn't make sense to force people to download all the data. so that may be something additional to look into for the validator.
I have also just come across something I cannot really "fix" via schema divergence, namely
/sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json
(inDANDI:000026
). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.
that just shows the power of standards and validators -- ability to detect presence of such typos ;) yeah, needs to be fixed in the dataset itself.
I can remove sub-MITU01_sessions.tsv - I guess I missed noticing the point where that file disappeared from the spec. I'll add a README and see if I can reconstruct and backfill the CHANGES file
@LeeKamentsky - it hasn't disappeared. it's just an optional file: https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#sessions-file (i wouldn't remove it).
I can remove sub-MITU01_sessions.tsv
@LeeKamentsky I don't think it has disappeared from the spec, just that it's not represented in the YAML schema data which we use for the validator. So those files are not the issue. But the typo with the underscore and microscopy subdirectory should be fixed at some point.
Hello everyone!
I had a nice chat with @TheChymera yesterday, we'll fix the ses-SPIM directories as soon as we have some time. For the other image modalities I think you can contact
@satra do we have individual issue trackers for each dataset? I remember I saw something like that once, but can't find it again. It might be better to discuss updates there since there seem to be a lot of different things going on in this thread. Ideally we could discuss:
DANDI:000108
DANDI:000026
separately.
you can do it in the github.com/dandisets org where each dandiset has it's own repo.
@gmazzamuto can you send me the contact details per email, or ping the github accounts (I searched but don't think I could find them).
Issues with the datasets are now tracked here: https://github.com/dandisets/000108/issues https://github.com/dandisets/000026/issues
Thus the only relevant schema changes remain:
echo
entity in VFA
suffix files, our schema change, pending a reply from BIDS here..h5
support for microscopy files, our schema change, not currently discussed with BIDS.ngff
support for microscopy files our schema change, not currently discussed with BIDS.Ok, so we have a new schema version containing only changes which enhance but do not diverge from BIDS. Almost all of both datasets fail with this schema, pending issues reported on them. https://github.com/TheChymera/bids-schemadata/tree/dandi_strict
@yarikoptic - i wanted to bring this up a little bit to ensure that we can do a bit more with bids metadata extraction before full fledged bids support #432.
can we simply extract the metadata from a file path when possible (identifiers for subject, session, sample, modality-via-folder, technique-via-suffix, when available)?
@djarecka - is this something you could take a quick look at? dandiset 108 and 26 should provide a fair bit of examples and you can play with the path names of files using the API. this should return some structured metadata for the non-NWB part of the processing of a file. it should look like the session metadata and the subject metadata info we extract from NWB.