con / nwb2bids

Reorganize NWB files into a BIDS directory layout.
1 stars 1 forks source link

Non-sanitized session strings in DANDI:000473 #10

Closed TheChymera closed 3 months ago

TheChymera commented 3 months ago

Relevant snippet:

├── sub-156131
│   ├── ses-156131_20191112-probe0
│   │   └── ephys
│   │       ├── sub-156131_channels.tsv
│   │       ├── sub-156131_contacts.tsv
│   │       ├── sub-156131_probes.tsv
│   │       └── sub-156131_ses-156131_20191112-probe0_ephys.nwb
│   ├── sessions.json
│   └── sessions.tsv
``` [deco]~ ❱ tree /mnt/data/.scratch/ /mnt/data/.scratch/ ├── participants.json ├── participants.tsv ├── sub-128514 │   ├── sessions.json │   └── sessions.tsv ├── sub-128515 │   ├── sessions.json │   └── sessions.tsv ├── sub-128516 │   ├── sessions.json │   └── sessions.tsv ├── sub-147463 │   ├── sessions.json │   └── sessions.tsv ├── sub-147465 │   ├── sessions.json │   └── sessions.tsv ├── sub-152414 │   ├── sessions.json │   └── sessions.tsv ├── sub-152417 │   ├── sessions.json │   └── sessions.tsv ├── sub-152419 │   ├── sessions.json │   └── sessions.tsv ├── sub-156130 │   ├── sessions.json │   └── sessions.tsv ├── sub-156131 │   ├── ses-156131_20191112-probe0 │   │   └── ephys │   │   ├── sub-156131_channels.tsv │   │   ├── sub-156131_contacts.tsv │   │   ├── sub-156131_probes.tsv │   │   └── sub-156131_ses-156131_20191112-probe0_ephys.nwb │   ├── sessions.json │   └── sessions.tsv ├── sub-216300 │   ├── sessions.json │   └── sessions.tsv ├── sub-216301 │   ├── ses-216301_20200521-probe0 │   │   └── ephys │   │   ├── sub-216301_channels.tsv │   │   ├── sub-216301_contacts.tsv │   │   ├── sub-216301_probes.tsv │   │   └── sub-216301_ses-216301_20200521-probe0_ephys.nwb │   ├── sessions.json │   └── sessions.tsv ├── sub-225757 │   ├── sessions.json │   └── sessions.tsv ├── sub-225758 │   ├── sessions.json │   └── sessions.tsv ├── sub-225759 │   ├── sessions.json │   └── sessions.tsv ├── sub-258412 │   ├── sessions.json │   └── sessions.tsv ├── sub-258414 │   ├── sessions.json │   └── sessions.tsv ├── sub-258416 │   ├── sessions.json │   └── sessions.tsv ├── sub-258419 │   ├── sessions.json │   └── sessions.tsv ├── sub-259112 │   ├── sessions.json │   └── sessions.tsv ├── sub-268947 │   ├── sessions.json │   └── sessions.tsv ├── sub-268951 │   ├── sessions.json │   └── sessions.tsv ├── sub-273853 │   ├── sessions.json │   └── sessions.tsv ├── sub-273855 │   ├── sessions.json │   └── sessions.tsv └── sub-273858 ├── sessions.json └── sessions.tsv ```

I'm pretty sure this is what the metadata looks like in the DANDI archive, and it's read here → https://github.com/con/nwb2bids/blob/6faed97d1bdb39b0b6cd07c5c58a42657b1cb383/nwb2bids/base.py#L178

@yarikoptic I can sanitize as you suggested, replace all manner of special characters with X, just that you said in the meeting today DANDI already sanitizes them, but I don't think it did here.

yarikoptic commented 3 months ago

yes, ATM you would need to sanitize any non-alphanumeric to some alphanumeric to become BIDS-compliant. In DANDI's organize we also sanitize but allow for - and +, which isn't BIDS compliant ATM.

back refs on related efforts

yarikoptic commented 3 months ago

right away -- if identical there should be no sessions.json per each sub- folder - just place on top level. Moreover

TheChymera commented 3 months ago

@yarikoptic

In DANDI's organize we also sanitize but allow for - and +, which isn't BIDS compliant ATM.

but the string I got contains an underscore. I know how to fix it, but it seems to contradict the statement about what's already sanitized in DANDI.

If I only need to replace - and + that would best be done with replace, if I can expect literally anything from _ to ¯ it would be done via a whitelist of what characters to keep as they are. Might be safest to do that aynway since the NWB files don't need to come from DANDI 🤔

yarikoptic commented 3 months ago

Let me repeat

ATM you would need to sanitize any non-alphanumeric to some alphanumeric to become BIDS-compliant.

so it means that you need to replace underscore as well... and whitelist is just "alphanumeric" characters. I don't see what contradicts here... there were no statement that we replace only - and + .

TheChymera commented 3 months ago

Fixed as of 0034ce4c4d58ae2523e6b9a89b60a2d62139be63