restrict exclude regexes even more

yarikoptic commented 6 years ago

ATM in https://github.com/INCF/pybids/blob/master/bids/grabbids/config/bids.json#L3 my guess is that it is likely that if I have BIDS-compliant dataset with either sub-derivatives or ses-models (unlikely but still legit), those directories and their files would be excluded. Why not to restrict on having a path separator right before? i.e. ^(.*[/\\])?derivatives$ . Actually with https://github.com/grabbles/grabbit/pull/48 , if you would like to restrict only at the top level, then it could be as simple as ^derivatives$ I believe

tyarkoni commented 6 years ago

As of #135, no subdirectories are excluded from indexing by default. There are new include and exclude arguments that globally filter files (as well as corresponding directives in individual JSON config files). So you can do something like layout = BIDSLayout('.', exclude=['^(.*[/\\])?derivatives$']) and set your own exclusions. Does that work?

Relatedly, the config argument now accepts 2-tuples as a valid input, where the first element is the name of a built-in config, and the value can be a partial dictionary used to update the defaults. So, e.g., you can also do something like layout = BIDSLayout('.', config=[('bids', {"exclude": "derivatives/"}), which provides much more flexibility.

If this covers your needs, can we close the issue?

yarikoptic commented 6 years ago

hm, so it will be up to the user to establish excludes? since sourcedata/ and derivatives/ are nohow prescribed in BIDS, I thought that it was actually a good thing to be "stringent" and exclude what not follows the spec. my initial concern was "not to exclude too much", but now I think I have cursed myself into the opposite ;)

tyarkoni commented 6 years ago

The problem is that, as far as I can tell, the spec doesn't say anything about what can't be included in a BIDS project. And for reasons discussed in #135, having default exclusions turns out to be problematic. So this is probably the least of all the evils...

yarikoptic commented 6 years ago

I hear you. But it does describe what can be included in the tree and what does not necessarily follows BIDS specification. So I do not see a reason why not to exclude those known suspects by default (as you did in the past, just somewhat too permissively). What I am afraid of (and I could just be wrong) is that with this change dropping all excludes, by default, pybids would now start returning files believed to be associated with subjects even though they would not follow BIDS specification, as they would have been located under derivatives/ and/or sourcedata/. First of all it would be big change in behavior to users who already use pybids and rely on those being excluded.

tyarkoni commented 6 years ago

Fair point. Now I'm on the fence again. I wonder if the best way to handle this is to treat it as a special case, and have something like a spec_paths_only flag that, if True, restricts to only "core" BIDS paths (i.e., those explicitly mentioned in the spec). @chrisfilo @satra @effigies, thoughts? Sorry to keep bringing up the same issue in different flavors, but I do think this is likely to be one of the most common pain points for users, so it's worth trying to get it right up front.

tyarkoni commented 6 years ago

Let me bring up another concern, which is that if we want to stick to only what's defined in the core spec by default, it doesn't make sense to list exclusions... I think that would argue for defining inclusions instead. Otherwise, anything the user drops into the project that isn't explicitly excluded (e.g., derivatives, sourcedata, code, models, etc.) is going to get scanned. And I think that kind of behavior will really throw users for a loop (e.g., "why is pybids picking up files in scratch/, but not in derivatives/?"). But coming up with a list of all the valid folders/files in the BIDS spec is not something I necessarily want to build into bids.json, and it seems likely that we will miss things that should probably be included.

(Not to mention that it's not really clear what constitutes a "core" part of the spec in any case. The spec document explicitly mentions code/, derivatives/, stimuli/, and sourcedata/ as optional folders, but it doesn't seem to say anywhere--and I'm not sure it should--that these are second-class citizens that one shouldn't automatically expect to have access to if they exist.)

tyarkoni commented 6 years ago

@yarikoptic, what use case do you have in mind where exclusions would actually be important? It's clearly true that having extra directories to scan will slow things down somewhat, but if that becomes an issue at some point, we could refactor grabbit internally to speed things up (e.g., to store things internally in a relational database instead of Python objects).

The only other major concern I can see is that the user might get extra files they don't want returned by queries. But I'd argue that that's a user issue. E.g., if you have a whole pile of junk in derivatives/, and the first time you call layout.get_events(), you get back additional event files you don't want, that's the situation in which you should then initialize the Layout with include or exclude directives specified. My guess is that this is not going to be that common an occurrence. And it might even be a subtle way to encourage users to structure their BIDS projects in sensible ways, as they will find it easier to exclude derivatives/ than to exclude a list of scattered directories they've allowed to accumulate.

chrisgorgo commented 6 years ago

I don't think it would be wise for pybids to return info about files that are not part of the spec.

It will incur a performance penalty. We already experienced that with the validator and dataset ds000030 - see https://github.com/INCF/bids-validator/issues/344.
Requiring developers to specify exclude patterns will make it really hard or even impossible to write robust applications that grab the right files. This might lead to hard to debug errors.

effigies commented 6 years ago

My opinion that I'm extremely ready to be argued out of is that there should be three modes of inclusion:

1) Official spec. All and only entities and paths that are legal in the official spec will be included by default. As BEPs get merged, this will expand. So MEG/EEG should soon be auto-loaded.

2) Draft spec. These config files should be maintained along with the drafts, and be considered unstable, but they should be include-able by name. Derivatives, models and PET are good examples. On finalization, these become defaults, but should still (as in the current #135 proposal) be callable by datasets in isolation.

3) App/dataset defined. Apps can provide their own configurations to be included with a similar logic as draft spec configurations. Similarly, a dataset can provide its own configuration in a base directory. The main difference between these and draft spec configurations is where the config is relative to the pybids library, the pybids-using app and the pybids-indexed dataset.

This leads to two additional thoughts:

1) For "unknown" directories (e.g. sourcedata/), you'll need to peek in to the top level see if there's a dataset configuration. If it's not there, then ignore as expected.

2) We should probably attempt to standardize these configuration files, so that datasets can define how grabbids crawls them and BIDS apps can at least in theory not have to use pybids. Then, as @satra suggested, the app can save the configuration it used at the time of writing, which will help derivatives datasets survive any changes in the draft spec.

satra commented 6 years ago

@tyarkoni, @yarikoptic - this is almost like a matrix of paths x configs, with the unknown being - what is in the dataset. it could be "raw data" [sourcedata + bids], it could be bids layout only, it could be derivatives of any kind and quantity. the design of layout should not try to assume what's in the dataset. and i think currently it doesn't.

the question one can ask is whether to include all data underneath a root path or not or force the user to specify which ones. the api should allow both imo, however, which one is set to default can be discussed. my preference would be that if a root path is provided, everything is included (whether known or unknown from the spec). if a "user/developer" wants to restrict it the api should offer a way of excluding known keys or paths. i'm assuming things are relative to root.

BIDSLayout('/root_dataset', exclude=['sourcedata', 'scratch', 'derivatives/bad_deriv'])

one could also have something that works with the BIDS validator:

BIDSLayout('/root_dataset', exclude=['sourcedata', 'scratch', 'derivatives/bad_deriv', '.heudiconv'],
            exclude_invalid=True)

and then for anything included configs can be added/redefined as presented before.

in some ways this is similar to the .bidsignore work as well (i think anything in .bidsignore should also be ignored by the layout reader).

correspondingly, perhaps there should be an overriding keyword which allows everything to be included.

tyarkoni commented 6 years ago

Thanks all, this is helpful. @satra, everything currently works exactly as you propose (including a validate argument that filters based on passing the BIDSValidator, though the validator itself still needs some work, I believe). You can actually pass either include or exclude arguments (but not both!), and these work either globally (if passed to init) or on a config-specific basis (if included in the JSON files). My thinking to this point was basically in line with yours re: expected default behavior as well as what the API allows you to do.

That said, I can see @chrisfilo's point about it being hard to write code that behaves predictably if you don't know what else could be sitting in the user's project. One way or another, it seems likely that stipulating a "restrictive" mode of access may be necessary to assure deterministic behavior. And then the question becomes whether an app developer should have to write the inclusion rules themselves, or whether there should be a default setting that provides a reasonable guess.

I think @effigies' suggestion is a nice compromise that fits with my suggestion to have something like a spec_paths_only argument. If False (default), everything is scanned. If True, only "core" files/folders are scanned. Beyond that, the user can exercise fine-grained control via all of the other channels (include and exclude arguments, custom JSON config files, etc.). Does that work for everyone?

If so, @chrisfilo, could you elaborate on what you think of as the set of files defined by the spec? Is it co-extensional with what's considered valid by the BIDSValidator? If so, that would make my life much easier--but it will likely require someone to invest the time making sure the validator is working and up to date, because I'm pretty sure it has some bugs/missing coverage at the moment, and I don't have the time to take it on.

chrisgorgo commented 6 years ago

Yes - the BIDSValidator is the way to go (actually current set of restrictions have been working fine for MRIQC and FMRIPREP so far). We should probably think of a way to share regular expressions used in JS bids-validator and this project.

tyarkoni commented 6 years ago

Excellent, that works for me. Unless anyone has objections to the above proposal, I think we've converged.

Actually, this makes life super easy, since if we're okay relying on the BIDSValidator, we don't even need to add an argument--we already have validate built-in for this. So I think we can just go with #135 in its current form.

tyarkoni commented 6 years ago

+1 for sharing regular expressions. I think having a master JSON list that looks like:

[
  {
    "name": "is_cont",
    "pattern": "^\\/(sub-[a-zA-Z0-9]+)",
    "description": "Verifies that the file is..."
  },
  ...
]

...would probably be sufficient. Then we could dynamically build the entire validator (or most of it at least) directly from the JSON file, and still provide informative messages about why a failure occurred if needed.

bids-standard / pybids

restrict exclude regexes even more #131