bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
275 stars 157 forks source link

[ENH] Allow for - (dash) (and/or +) in <label> #1165

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

I thought that we already had an issue filed for this, but may be it was just my comments in the google doc version of the BIDS specification, so filing anew. Please close it referencing original one if one already exists.

ATM <label> is as I formalized in yet to be finished PR is allowed to contain only alpha numerics. That makes it really difficult in some cases to place multiple entries (e.g. details for _acq- or abbreviate details of preprocessing in _desc- of BIDS derivatives). In ReproIn heuristic for heudiconv, where we do allow for - we just replace them with an ugly X as a separator to just stay compatible with BIDS.

I have been suggesting to extend the list of allowed characters in <label> to include at least - and possibly others (e.g., +) which otherwise should not introduce any backward incompatibility with BIDS 1.0 -- all BIDS 1.0 names would be compatible with new syntax allowing - in the <labels>.

The original concern (probably at least a year back) which precluded any action was a possible breakage of parsing BIDS filenames for BIDS-supporting software, and was suggested to be included in coming some time in the future BIDS 2.0.

N.B. I have failed to file and issue outlining plans for BIDS 2.0, so I initiated a new milestone (2.0.0) to which I am adding this new issue.

IMHO (and may be we should outline it somewhere) BIDS supporting applications should be coded adhering to Postel's principle which is driving TCP implementations to "be conservative in what you do, be liberal in what you accept from others.". And most likely, in the case of added -, they already are since they are more likely to be splitting based on _ and then on - to separate key from value instead of full rigid regular expressions limiting to the allowed characters (not so obviously) mandated by the BIDS specification. So the only catch here indeed might be splitting on - without restricting that only a single split should happen. That is where breakage could happen if - is added. The possibility of breakage in case of extending the list of allowed characters with + (or even : which I dislike to suggest due to messing up scp/ssh) unlikely to have negative ramifications.

Please share your thoughts/opinions

effigies commented 5 years ago

I'm personally okay with - or +, and can't really remember the arguments against them.

oesteban commented 5 years ago

I'm okay with +, and not very fond of - but I understand it could be added. I would not allow - to be present in the suffix entity, though.

yarikoptic commented 5 years ago

I would not allow - to be present in the suffix entity, though.

Great point I haven't thought to exercise @oesteban ! -- I would not allow it as well, makes parsing much more difficult etc! The only spot I see where <label> is used not within _key-<label> is

(git)hopa:~/proj/bids/bids-specification[derivatives]git
$> git grep '_<label'
src/04-modality-specific-files/04-intracranial-electroencephalography.md:called `electrical_stimulation_<label>`, with labels chosen by the researcher and

so might need analysis/tune up (or not)

yarikoptic commented 5 years ago

ok, after https://github.com/bids-standard/bids-specification/pull/152 is merged in one way or another we could have a PR extending the list of allowed characters non-ambiguously in a single location ;-)

nicholst commented 5 years ago

Note that the request to include hyphens in key-value labels is currently in the "Suggestions for BIDS 2.0" Google doc:

Allowing hyphens in filename key values

which suggests, as I had recalled, that this was a settled issue. Including hyphens increases the complexity of parsing the filenames, was the reason I recall.

nbeliy commented 5 years ago

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

yarikoptic commented 5 years ago

@nicholst Oh, that is where BIDS 2.0 "extensions" are discussed -- thank you for the link! I have added the link to the 2.0.0 milestone description. I think it might be worthwhile to move them all under issues over here at some point, so elderly folks like us could easily find everything relevant in the single location (here).

I would not say that parsing complexity would increase, but indeed it might require some adjustment.

nicholst commented 5 years ago

@nbeliy,

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

Just reciting the conclusion of a long thread about the issue. Like anything, this issue can be re-raised, but I think there is the basic issue of extra clarity from having the hyphen play exactly one role in file/directory names.

yarikoptic commented 5 years ago

my 1c: This issue by now is probably a few years old (IIRC I have suggested it possibly even during BIDS 1.x initial development/release) . If we acted sooner, it would be no issue but even those years back I have proposed it the main argument against was "we would break the tools" of which there were much fewer. So longer we wait to act, the harder would be to adopt this change.

oesteban commented 5 years ago

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

sappelhoff commented 5 years ago

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

Yes, and on top of that I think what this discussion needs are:

  1. concrete use cases where the proposed changes would significantly enhance BIDS (summaries/links to such cases)
  2. a concrete PR showing the diff of the spec prior and post the proposed changes (what @oesteban said)
  3. more participants to voice their opinions - especially developers

Currently this change seems to me as a minor improvement with the potential for a major confusion ... but I am very happy to be convinced otherwise :-)

nicholst commented 5 years ago

Use cases:

Subject or session labels come from long-established project that has identifiers with dashes.

Related, such a project uses identifiers with underscores (but not dashes), and the underscores need to be replaced; having dashes as an option for replacing is helpful.

robertoostenveld commented 4 years ago

Is UTF8 allowed in BIDS file names, or if not, should it be? If so, there are many dashes to choose from which are not a minus (ASCII decimal 45). For example ‒ – and —. See https://www.w3schools.com/charsets/ref_utf_punctuation.asp

oesteban commented 4 years ago

Using alternatives that look like the proposed addition doesn't help much with readability, makes it harder. Imagine a suffix like _inplane‒T2.json (I used your first UTF8 character). _inplane+T2.json breaks the suffix a bit, but is still readable. The limitation of the suffix is also important in this kind of proposal, IMHO.

I would be fine with the + symbol and I think the dash is not worth the pain. Adding the dash would feel to me as useful as allowing spaces (i.e., error-prone and likely confusing).

yarikoptic commented 4 years ago

I think we should explicitly forbid UTF8 and limit to ASCII to avoid such "tricks" which could potentially cause confusion. So let's move forward with "+"?
I doubt any tool was so strict with regexp that it listed only alpha numerics. and sooner we change, less possible negative effect it would have. This issue already dragged long time (I originally suggested in the times of BIDS being a google doc), and I think it is worth to be accepted to 1.x series without awaiting 2.0 which I am not quite sure would come in any foreseeable future

robertoostenveld commented 4 years ago

Agreed about limiting character sets, I also don’t look forward to multiple types of dashes or emojis in filenames. But the limit should be explicitly mentioned.

Regarding the “+” I am neutral.

Verstuurd vanaf mijn iPhone

Op 2 nov. 2019 om 00:57 heeft Yaroslav Halchenko notifications@github.com het volgende geschreven:

 I think we should explicitly forbid UTF8 and limit to ASCII to avoid such "tricks" which could potentially cause confusion. So let's move forward with "+"? I doubt any tool was so strict with regexp that it listed only alpha numerics. and sooner we change, less possible negative effect it would have. This issue already dragged long time (I originally suggested in the times of BIDS being a google doc), and I think it is worth to be accepted to 1.x series without awaiting 2.0 which I am not quite sure would come in any foreseeable future

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

yarikoptic commented 4 years ago

ok, it seems that the path of the least resistance would be to add + as allowed character. I will look into proposing a PR here and against bids-validator for that (unless someone would beat me to it which I would really appreciate)

yarikoptic commented 4 years ago

oh hoh, it had been a year! Few final remarks and then I will try to find time to compose a PR.

Since I and others mentioned a number of use cases, and IMHO it would be easier to strip away than to add later, I will prepare a PR which would introduce allowing both - and + and then we will see how it goes.

yarikoptic commented 2 years ago

woohoo -- I found you my darling issue -- you have been moved and I thought that I just lost my "search foo" skill!

With the addition of BIDS URIs (https://github.com/bids-standard/bids-specification/pull/918) into BIDS specification (aiming for 1.8.0 release) I really want to make a case for this issue to be given yet another consideration:

Now I will really make an effort to file a PR but might need to prep few "preparatory" PRs first.

effigies commented 2 years ago

I moved this back. Sorry, not sure exactly why it got moved; perhaps it seemed abandoned, or was lumped in with the BIDS-2 proposal linked above because of the - suggestion.

Reading through the thread, I think there's basically:

So I think a reasonable proposal is to change the regex for label to [a-zA-Z0-9+], and associated textual changes. It does not seem worth the time to write for both + and -, given that there's known opposition and - was specifically tabled before.

TheChymera commented 2 years ago

While I have no objection to + other than that it looks ugly, - I think is a very bad idea. I think the main issue is not high-level tools such as validators, but basic tools such as user bash scripts and path legibility at a glance. Minus is a special character, and that is good. I would make an analogy to periods in file names, using periods to delimit words in a filename is a bad idea.

The genealogy of this request is also indicative of problems which BIDS could help solve by encouraging better data practices. This --desire generally crops up because operators will put the experiment name, animal name, animal genotype, session date, etc. in the name field to keep it all in one place. That is a poorly annotated experiment and not a string which is just “too fancy” for BIDS. So I think there is something to be said for explicitly not supporting minuses. If you need a minus you're trying to cram 2 concepts in a field, pretty much by definition.

yarikoptic commented 2 years ago

... using periods to delimit words in a filename is a bad idea

and thus we don't use periods in the middle of the filenames (although could), but we do use .tar.gz and others to compose extensions together. In other words, "allowed != always a good idea". But on contrary "disallowed == need to come up with uglier workaround" (.tgz?).

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Another analogy -- BIDS doesn't restrict the length of the <label> but nobody does super long labels. The same sanity I expect to be applied to use of - by the user.

oesteban commented 2 years ago

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Are we not talking about the 20% in the 80/20 principle? How many datasets would need such flexibility to specify acq-? This means, all other entities are the same so you can't disambiguate ?

I tend to agree with Chris' assessment, and, as I mentioned before, once + is allowed, it could be given tokenization semantics. That would turn - redundant as its sole proposed purpose is label tokenization.

neurorepro commented 1 year ago

I agree with the usefulness of +, is there any update on this issue ?

effigies commented 1 year ago

No update. Rereading this, I think my assessment remains the same as before: We can do this. Somebody needs to make a concrete proposal in the form of a PR. I have some new questions that I would like to see addressed in a proposal (possibly they came up and I skimmed too briefly):

1) Can all labels include +, or should we have labels ([a-zA-Z0-9]+) and multi-labels ([a-zA-Z0-9+]+)? 2) Is + allowed in index ([0-9]+) or just label? 3) Can + be the first or last character of a label(/index)? (If no, the regexes will get much uglier.) 4) Is this generally available or only in derivatives? 5) Do we define any semantics? e.g., if you combine run-01 and run-02, is run-01+02 now acceptable? Previously derivatives said to remove the run entity. In the other direction, does run-01+02 imply any relationship to run-01 or run-02 per the standard? This will have implications on tooling: if I query layout.get(run=1), should I get run-01 and run-01+02? 6) What does this mean for suffixes? These aren't "entities", so we could say this is just labels.

I would suggest the following answers, but I'm not 100% set:

1) Split labels and multi-labels. Subject shouldn't have a multi-label IMO. Possibly others shouldn't as well. 2) Just label. Indexes have numerical meaning and a list of numbers is less clear how to work with. 3) Weak no. I could imagine a label like B1+ might be useful enough to justify permitting it. 4) General. 5) No semantics. It's just another character in a label for human readability. Tooling should not count on consumers to interpret multi-labels as compositions of discrete meanings. If detailed metadata needs to be communicated, it should be in JSON. 6) Leave suffixes as a controlled vocabulary. On a case-by-case basis a suffix with a + can be considered.

neurorepro commented 1 year ago

Thanks @effigies , I'm happy to help on the PR

yarikoptic commented 1 year ago

Are you @effigies to introduce some new semantic behind milti-label - Ie would have some meaning beyond just being a label?

effigies commented 1 year ago

Hmm. Not intentionally (see 5). Possibly that argues against my position 1. I do think the end result should be logically consistent, however these questions are answered.

TheChymera commented 1 year ago

The same sanity I expect to be applied to use of - by the user.

Ambitious hope. I think the lack of minus support is usually the point where people are introduced to metadata separation. I'm still explaining to one of our colleagues why the name of his mouse is not part of the session name.

I continue to think this makes BIDS less conducive to good metadata practices. If you really really want to allow something like this, maybe it should be made sufficiently prohibitive so as to give the user pause as to whether it's worth it. We could support some other character. Perhaps the cdot, , it's actually very aesthetic and distinctive from the minus (and very obviously not a period either). This is similar to an older suggestion by @robertoostenveld , except the cdot is visually distinct enough from the dash/minus family (em-dash , en-dash , and minus -) to not cause visual confusion.

The prohibitiveness of comes from most people not knowing where it is on their keyboards.

oesteban commented 1 year ago

@effigies could you confirm that here https://github.com/bids-standard/bids-specification/issues/1165#issuecomment-1549699835 you are referring to only + ? I believe one way of pushing this forward faster is dropping - from proposals now, see + added and if someone is really interested in also permitting - then iterate on a new PR.

Another weak idea regarding 3 (and trying to get regexes even uglier) is: should two + be allowed? Something like: acq-O++O.

Overall, I totally agree with @effigies' assessment, but also agree with @yarikoptic on why this should get added semantics (e.g., allow for subject labels).

effigies commented 1 year ago

Yes, just +. Let's consider - as dropped from this proposal.

yarikoptic commented 4 months ago

Given the absent concrete action (from me) on this, I am thinking that we should bundle this one into BIDS 2.0. Hence I will announce this closed and let's continue on

tsalo commented 2 weeks ago

The BIDS maintainers have discussed this issue and I think we've converged on allowing +, but not -.

yarikoptic commented 1 week ago

FWIW, as #1926 shows, change is "substantial" enough to also allow for -, I would even might have postponed it until BIDS 2.0 frankly: any BIDS compliant tool might need to adjust the way they parse filenames, especially if they do not use schema yet. IMHO - is dominantly already used in various use cases for separation of words in the labels, and + is more of "something which is not -" but without really a big reason: I think BIDS should encourage consistent parsing of file names based on simple algorithm of "split by _ into entities + suffix; then split each entity into name and the value by splitting on the first -". That would make all tools robust, and it would make them work with any regex we adopt as allowed for the BIDS: tools IMHO should not rely on any specific regex, it should be just a job of the validator to ensure compliant values.