Standardize .bidsignore

bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification

https://bids-specification.readthedocs.io/

Creative Commons Attribution 4.0 International

280 stars 165 forks source link

Standardize .bidsignore #131

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

ATM bids-validator supports .bidsignore file and describes it as

.bidsignore

Optionally one can include a .bidsignore file in the root of the dataset. This file lists patterns (compatible with the .gitignore syntax) defining files that should be ignored by the validator. This option is useful when the validated dataset includes file types not yet supported by BIDS specification.

*_not_bids.txt
extra_data/

while also hardcoding some additional ignores:

    .add('.*')
    .add('!*.icloud')
    .add('/derivatives')
    .add('/sourcedata')
    .add('/code')

In the Python land of pybids there is a constant battle to hardcode various additional ignores, see e.g. https://github.com/bids-standard/pybids/pull/277#issuecomment-456493679 and a common consensus is that supporting .bidsignore would be the way to go forward. For that to happen properly, and to not have varying/possibly conflicting support of .bidsignore file across software toolkits, it is desired to describe that in BIDS specification document somewhere, e.g. a new 99-appendices/09-software-support.html, stating our expectations for that file syntax. I see two choices

unification with .gitsignore - sounds like a nice idea, but I am afraid it might be also a fragile one unless we start use git itself, i.e. git-check-ignore (with hard-dependency on git, which I am not) since through time Git might introduce/change .gitignore specification. ATM (if I got it right) bids-validator relies on https://www.npmjs.com/package/ignore to provide JS implementation. For pybids we could probably find some Python implementation, but so far the ones I found are too adhoc and/or old/abandoned (e.g. https://github.com/snark/ignorance), so calling out to git-check-ignore sounds like a best idea
prescribed subset of .gitignore - we clearly describe what patterns are supported etc.

What do you think the @bids-standard/everyone ?

nicholst commented 5 years ago

Seems sensible... like the use of .gitignore syntax and not unifying with .gitignore itself.

sappelhoff commented 5 years ago

it is desired to describe that in BIDS specification document somewhere, e.g. a new 99-appendices/09-software-support.html, stating our expectations for that file syntax.

though as long as we describe our ignore syntax explicitly it is IMO not strictly necessary to have it as a subset of .gitignore, that is, a few additional ignore rules should be fine even if they are not supported by .gitignore (all given that there are good reasons)

robertoostenveld commented 5 years ago

You write

support of .bidsignore file across software toolkits

I have a conceptual issue there... but perhaps I am just stating the obvious.

with .gitignore I know that it is git (the application) that is reading it and subsequently ignoring certain files. With .bidsignore it is not explicit which application should ignore the files. I expect bids-validator to ignore it (and know that expectation to be met). But if have a .bidsignore in my own BIDS dataset, I don't expect my own pipelines to be ignoring the listed files. Most likely I put the to-be-ignored datafiles there just because my application/pipeline needs them.

For me .bidsignore conceptually corresponds to the "diff" between the file list that the one application (bids-validator) expects, and the file list that the other application (my pipeline) expects. That works in the specific case, but fails to scale if three applications are involved. I also don't see how it would scale to arbitrary applications.

Can we agree on .bidsignore only to be read and interpreted by the bids-validator? Or more precisely, that it is the list of files (and directories) to ignore to make the dataset bids compliant. Compliance testing is currently implemented in the nodejs bids-validator app, but in the future might be checked by other toolkits.

Also note that if I were to release my application/pipeline as "myapp", and if it would need to ignore files, I might want those be listed in a .myappignore. As such it would fall outside the bids-specification (since future apps cannot be part of the current specification). The solution to keep the dataset bids-compliant would then be to include .myappignore in the .bidsignore.

edickie commented 5 years ago

My assumption would be that .bidsignore would be used by the bids-validator and pybids (the python package that many python-based bids-apps are using for parsing the BIDS folder structure). Are there any others?

On Wed, Jan 23, 2019 at 4:52 AM Robert Oostenveld notifications@github.com wrote:

You write

support of .bidsignore file across software toolkits

I have a conceptual issue there... but perhaps I am just stating the obvious.

with .gitignore I know that it is git (the application) that is reading it and subsequently ignoring certain files. With .bidsignore it is not explicit which application should ignore the files. I expect bids-validator to ignore it (and know that expectation to be met). But if have a .bidsignore in my own BIDS dataset, I don't expect my own pipelines to be ignoring the listed files. Most likely I put the to-be-ignored datafiles there just because my application/pipeline needs them.

For me .bidsignore conceptually corresponds to the "diff" between the file list that the one application (bids-validator) expects, and the file list that the other application (my pipeline) expects. That works in the specific case, but fails to scale if three applications are involved. I also don't see how it would scale to arbitrary applications.

Can we agree on .bidsignore only to be read and interpreted by the bids-validator? Or more precisely, that it is the list of files (and directories) to ignore to make the dataset bids compliant. Compliance testing is currently implemented in the nodejs bids-validator app, but in the future might be checked by other toolkits.

Also note that if I were to release my application/pipeline as "myapp", and if it would need to ignore files, I might want those be listed in a .myappignore. As such it would fall outside the bids-specification (since future apps cannot be part of the current specification). The solution to keep the dataset bids-compliant would then be to include .myappignore in the .bidsignore.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-specification/issues/131#issuecomment-456738330, or mute the thread https://github.com/notifications/unsubscribe-auth/AMl59qyQkdVQDLa-rw9ROFQ14_xgt6C1ks5vGDDbgaJpZM4aNYRV .

jasmainak commented 5 years ago

I agree with @robertoostenveld . However, regarding this:

Compliance testing is currently implemented in the nodejs bids-validator app, but in the future might be checked by other toolkits.

Do you mean that these toolkits are completely equivalent implementations? That is, for the same input they produce the same validation output. Otherwise, I can imagine a scenario where one dataset would pass on one validator but not on the other. Thus, I think there can be only one validator ...

yarikoptic commented 5 years ago

Can we agree on .bidsignore only to be read and interpreted by the bids-validator?

I would say - "No", because it is is not a .bids-validatorignore. But indeed software-specifics is a valid concern, but it is not for the "dataset producer" to code up all the ignores for all possible software, so I would say there should be a way to provide generic ignores (e.g. supplementals, environments etc, which are present but not part of BIDS and/or just shouldn't be considered -- may be even some bad sub-* directories) and then software-specific ignores. But if it is app specific -- it is for the app to take care about ignoring some to start with. E.g. bids-validator already hard-codes application specific ignore - derivatives since it doesn't support them (for now at least). And that is why indeed generally speaking derivatives/ shouldn't be in the .bidsignore. If there is some other use case where one tool (flexibly) should ignore some directories which are generally available in the BIDS (any specific usecase @robertoostenveld?), I see two ways to proceed;

if we decide to go for a "custom" format for .bidsignore -- then we could allow for annotation of entries with a list of software to take them into account (somewhat messy, don't like it)
in addition ot .bidsignore -- add that .myappignore or .bidsignore-myapp (myapp could be bids-validator or pybids) and corresponding software would read it (pybids would probably then just allow to pass some "software" name instead of pybids since pybids largely is a library so it might be for a higher sitting software to make decisions on what to ignore)

jasmainak commented 5 years ago

wouldn't you say that this requires a fix in the validator rather than bending the standard to make the validator happy? Perhaps a command line flag like --ignore-derivatives or so might do the trick.

Then you can have .pybidsignore for files to be ignored by pybids. For me .bidsignore should mean what you call .bids-validatorignore because you have no standard without the validator.

effigies commented 5 years ago

This is starting to get complicated, IMO, which worries me. I think .bidsignore is a decent idea, and that its interpretation is "I know about these non-BIDS-conformant files. Don't throw errors." It is then up to the app developer what else to do about these files.

With bids-validator, that's easy: Its whole purpose in life is to throw errors.

With pybids, it's a little less straightforward, as I may mean "Do not index" or "Index, but don't yell about non-standard entities". I think we have strictness modes in bids.BIDSLayout already, so I think that's fine. If an app developer needs some other interpretation of .bidsignore, that can be handled in pybids. Further, pybids allows custom entities to be defined by app developers, which is going to be necessary as derivatives cannot encompass all possible data descriptions.

But I really don't like the idea of a .pybidsignore vs .bids-validatorignore, selecting different subsets of the dataset to make invisible to different tools. It's either valid BIDS or it's not, and then how to work with non-standard files is a separate question. It's also no longer describing the data but a dataset-dependent config file for an app. That's a bit concerning, and to the extent that this is desirable, I think it is more appropriately handled by the as-yet-nonexistent BIDS Execution configuration.

That said, there's nothing stopping you from adding .*ignore to your .bidsignore and making 20 different .XYZignore files.

satra commented 5 years ago

@yarikoptic - but .bidsignore was introduced to the bids spec to support validation for bids. it was not introduced to support any other application other than to say "ignore these files when determining if this a bids dataset".

and to reduce output noise from the validator just as .git and others do to reduce noise on their output.

so it's tied to the spec and hence to validation of the dataset.

yarikoptic commented 5 years ago

@yarikoptic - but .bidsignore was introduced to the bids spec ...

Was it?

yarikoptic commented 5 years ago

@effigies et al. I agree that we should stop talking about separate application specifics here and concentrate on the .bidsignore to get described in the spec as to prescribe files/directories which must not be considered as far as we are talking about BIDS (even if conform bids, possibly in a subdirectory structure) - queries, manipulations, etc. So mentally really the same what .gitignore does for git.

effigies commented 5 years ago

Just to return to this, I want to copy in the .gitignore conventions so we can directly discuss them. I think they're generally reasonable, though I don't see much value in the ! rules. The rest can be mostly summarized as fnmatch + ** globs.

And my overall position is I would rather have explicit rules than a dependency on git, which could potentially (though is unlikely to) be a moving target.

PATTERN FORMAT

A blank line matches no files, so it can serve as a separator for readability.

A line starting with # serves as a comment. Put a backslash ("\") in front of the first hash for patterns that begin with a hash.

Trailing spaces are ignored unless they are quoted with backslash ("\").

An optional prefix "!" which negates the pattern; any matching file excluded by a previous pattern will become included again. It is not possible to re-include a file if a parent directory of that file is excluded. Git doesn’t list excluded directories for performance reasons, so any patterns on contained files have no effect, no matter where they are defined. Put a backslash ("\") in front of the first "!" for patterns that begin with a literal "!", for example, "\!important!.txt".

If the pattern ends with a slash, it is removed for the purpose of the following description, but it would only find a match with a directory. In other words, foo/ will match a directory foo and paths underneath it, but will not match a regular file or a symbolic link foo (this is consistent with the way how pathspec works in general in Git).

If the pattern does not contain a slash /, Git treats it as a shell glob pattern and checks for a match against the pathname relative to the location of the .gitignore file (relative to the toplevel of the work tree if not from a .gitignore file).

Otherwise, Git treats the pattern as a shell glob: "*" matches anything except "/", "?" matches any one character except "/" and "[]" matches one character in a selected range. See fnmatch(3) and the FNM_PATHNAME flag for a more detailed description.

I copied that description here:

The name fnmatch() is intended to imply filename match, rather than pathname match. The default action of this function is to match filenames, rather than pathnames, since it gives no special significance to the slash character. With the FNM_PATHNAME flag, fnmatch() does match pathnames, but without tilde expansion, parameter expansion, or special treatment for a period at the beginning of a filename.

A leading slash matches the beginning of the pathname. For example, "/*.c" matches "cat-file.c" but not "mozilla-sha1/sha1.c".

Two consecutive asterisks ("**") in patterns matched against full pathname may have special meaning:

A leading "**" followed by a slash means match in all directories. For example, "**/foo" matches file or directory "foo" anywhere, the same as pattern "foo". "**/foo/bar" matches file or directory "bar" anywhere that is directly under directory "foo".

A trailing "/**" matches everything inside. For example, "abc/**" matches all files inside directory "abc", relative to the location of the .gitignore file, with infinite depth.

A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, "a/**/b" matches "a/b", "a/x/b", "a/x/y/b" and so on.

Other consecutive asterisks are considered regular asterisks and will match according to the previous rules.

tyarkoni commented 5 years ago

I thought I commented on this earlier, but I guess not. Just to add my two cents: I feel pretty strongly that .bidsignore handling should live in BIDS-Validator and not in PyBIDS. I'm happy to pass the path to the .bidsignore file (or even its contents) to the validator at initialization (probably necessary, since the validator currently operates only on a file-by-file basis), but there's no reason PyBIDS should have to be aware of .bidsignore at all. Putting handling in the validator ensures that other Python packages can guarantee proper .bidsignore processing without having to reimplement it themselves.

yarikoptic commented 5 years ago

@tyarkoni Could you please elaborate a bit more since I am not sure I am following.

I'm happy to pass the path to the .bidsignore file (or even its contents) to the validator

Are you suggesting that PyBIDS could/should use bids-validator (the alien nodejs utility) to provide it (PyBIDS) with the list of ignored (or the opposite - legit) files, so that PyBIDS could then use that information to filter its results?

probably necessary, since the validator currently operates only on a file-by-file basis

it validates consistency of lengths for the same func/ file across subjects/runs - so clearly not just independently validates files in isolation.

but there's no reason PyBIDS should have to be aware of .bidsignore at all.

E.g. if I have a file (or a subtree of directories, e.g. myworkdir/) which is not a conventional BIDS, to pacify bids-validator I will add an entry to .bidsignore. It would stop whining and I get happy. Then I start using PyBIDS, and it would return me all the hits for the files, even the ones I .bidsignore'd (e.g. under myworkdir/). Is that what is intended?

Putting handling in the validator ensures that other Python packages can guarantee proper .bidsignore processing without having to reimplement it themselves.

How would that happen? validator would just ignore them, while Python packages would have no clue they were ignored (unless the idea for them to talk to validator)?

tyarkoni commented 5 years ago

Are you suggesting that PyBIDS could/should use bids-validator (the alien nodejs utility) to provide it (PyBIDS) with the list of ignored (or the opposite - legit) files, so that PyBIDS could then use that information to filter its results?

Each BIDSLayout internally initializes a BIDSValidator. Thereafter, file validation is done by calling that validator. I'm proposing doing something like BIDSValidator(bidsignore='/path/to/.bidsignore', ...). The BIDSValidator would be in charge of parsing that file and updating its internal filters for subsequent is_bids() calls.

it validates consistency of lengths for the same func/ file across subjects/runs - so clearly not just independently validates files in isolation.

Well, the JavaScript validator may be doing this... I don't believe the Python validator is (which is a problem that probably needs to be addressed separately).

E.g. if I have a file (or a subtree of directories, e.g. myworkdir/) which is not a conventional BIDS, to pacify bids-validator I will add an entry to .bidsignore. It would stop whining and I get happy. Then I start using PyBIDS, and it would return me all the hits for the files, even the ones I .bidsignore'd (e.g. under myworkdir/). Is that what is intended?

The BIDSLayout has ignore and force_index arguments that take precedence over everything else. So you can always make pybids do what you like. But I definitely don't think we should have separate .bidsignore and .pybidsignore files!

How would that happen? validator would just ignore them, while Python packages would have no clue they were ignored (unless the idea for them to talk to validator)?

On this proposal, the BIDSValidator would take a bidsignore argument. The BIDSValidator is currently unaware of BIDS projects (as I said earlier, it receives files one at a time). So it would fall to each tool to decide whether or not to pass the .bidsignore file to the validator.

tyarkoni commented 5 years ago

On re-read, @yarikoptic, I think what you may be missing is that the Python validator doesn't work like the JS validator. It doesn't do a single pass through the dataset and return problems; it currently validates each file on an ad-hoc basis. We definitely need to implement layout-level checking too (and that should probably live in the BIDSValidator as a method that takes a BIDSLayout as an argument), but otherwise the way it's set up is pretty helpful in terms of interaction with other packages (including pybids).