bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
221 stars 122 forks source link

Q: Flatten metadata keys? #694

Open tyarkoni opened 3 years ago

tyarkoni commented 3 years ago

An issue that's cropped up a couple of times, and is the underlying root cause of the current testing failures (see peripherally related discussion in #682) is that we have no good way of dealing with JSON metadata that have collections as values.

Currently, these are naively stored in the DB as dictionaries or lists, but that can cause problems throughout the codebase, as it's not really reasonable to check in a whole bunch of places to make sure that Tag values are hashable and respond accordingly. Plus, it's unpredictable from the user's perspective.

The proposal is to either (a) ignore nested metadata; (b) automatically flatten them (e.g., {"KeyA": {"KeyB": "myvalue"}} becomes {"KeyAKeyB": "myvalue"}; or (c) do both and let the user control that with an optional flag passed at indexer initialization (default value being to ignore). We could even have a third mode that does nothing (i.e., current behavior), allowing the user to access values as collections, but with the expectation that failures might occur unpredictably.

Thoughts? I incline towards (c) (without a "do nothing" option).

tyarkoni commented 3 years ago

@oesteban @effigies note that automatic flattening would also solve the problem in #682 for free, as "run" would now be "runDescription", and hence would never be scanned in the offending code inside __repr__ (and, more generally, we would never get conflicts for .tsv files, as the spec explicitly says values for column metadata have to be objects with keys like "Description").

effigies commented 3 years ago

What about KeyA.KeyB to make it clear that it's going into an object? You could imagine a conflicting runDescription field, but run.Description seems less likely to have an accidental clash.

I guess it does mean that get_run.Descriptions() isn't possible. Do we care that much about those convenience functions?

tyarkoni commented 3 years ago

What about KeyA.KeyB to make it clear that it's going into an object?

Fine with me. I agree that adds clarity.

Do we care that much about those convenience functions?

I don't, and TBH I feel like we should maybe scale back on a lot of the aliases and helper functionality in the next major release. We should probably save a list of proposed breaking changes somewhere for 1.0. I would also add removal of the id, filename, and dir return types as a convenience that is trivial to replace on the user end and adds unnecessary maintenance burden.

oesteban commented 3 years ago

I believe this test https://github.com/bids-standard/pybids/pull/684/files#diff-9562e9d99fe83db8bacdd0aa90936114ad0a87d40bdb12a470381e19e5e8774cR460 would fail currently with master (while #683 is not merged). And that test is not featuring examples of invalid BIDS metadata. Let me resync #683 with master and replicate the problem.

Then, if BIDS-Derivatives starts making use of nested dictionaries, there's potential that we will need to backtrack the flattening.

Finally, it seems that querying is working fine - as you said in the other thread, we really need to fix the test, not the tool :D. And a test that breaks things is a valuable test, perhaps invalids from the BIDS spec perspective, but very representative of how people may stretch pybids.

In favor of dropping features that don't really offer great value (e.g., return types). But I'm not fully confident that flattening dictionaries is going to make a dent in the maintenance burden. I would keep the idea and wait for BIDS-derivatives to evolve and see how complex metadata can get.

oesteban commented 3 years ago

Actually, this problem would happen again if some metadata file overshadows potential target of get() with return_type='id'.

So, if the user has a BIDS directory with 2 different run ids under func/ and an anat/sub-01_T1w.nii.gz with metadata:

{
    "run": ["unrelated", "values"]
}

Then this flattening would fail to avoid the clobbering, and so:

>>> set(layout.get(return_type='id', target='run'))

would break unless #684 is merged, after which:

>>> set(layout.get(return_type='id', target='run'))
{("unrelated", "values"), 1, 2}

which IMHO is the most appropriate outcome for someone doing this stupidity.

tyarkoni commented 3 years ago

Finally, it seems that querying is working fine - as you said in the other thread, we really need to fix the test, not the tool :D.

The current tests failures are due to internal problems with handling collections (instead of basic dtypes), so your patch won't fix them. E.g., one of the failures is arising because we assign the metadata value inside pandas DFs, and it doesn't like when the value is a dictionary. This can be worked around on a case-by-case basis, but I would prefer not to do that—I don't like the idea of dicts being a legal value for essentially any entity. I would rather force the user to (a) drop nested values, (b) flatten them, or (c) accept that things might break because they insist on passing around collections internally.

Basically the problem is that if we don't do something to handle dict/list metadata values, PyBIDS will break in certain cases. So our options are to either adopt something like the proposal above, or sweep through the codebase and handle a bunch of special cases (the most common class probably being when converting between the SQLAlchemy representation and pandas structures, which happens a lot), add another abstraction layer, etc. My preference would be for the former, but I could be convinced to do the latter, I suppose.

tyarkoni commented 3 years ago

Okay, on revisiting the various threads, I guess we're still kind of stuck. @oesteban's patch in #684 fixes the specific problem that spawned this whole discussion (in #681), but it doesn't address the general problem of dealing with nested collections, which could theoretically arise in many places throughout the codebase. Also, it makes returned collections immutable, which is on the one hand a pretty sane way to deal with the problem, but on the other hand is counterintuitive for users, and also adds a non-standard type (frozendict).

By contrast, the approach I proposed in the first comment of this thread is more principled, but it only works for dictionaries—we can't really flatten lists. (I don't think we want to represent "run": ["unrelated", "values"] as "run_1": "unrelated" and "run_2": "values").

I think maybe the best way forward here is to combine the two approaches by (1) treating dictionaries per my proposal, and (2) making lists hashable by converting them to tuples, as per Oscar's patch in #684.

The only case this leaves unhandled is where a value in a JSON sidecar is a list containing collections as elements. In that case, an exception would still be raised. But that latter scenario seems to me pretty unusual, so I incline to ignore it for the time being.

Thoughts, @effigies, @oesteban?

effigies commented 3 years ago

Sorry for not getting to this more quickly. I think it makes sense, and can't readily think of a better approach.

adelavega commented 3 years ago

It seems like part of the problem is that we are mashing together meta-data with entities. Entities are a strict set of reserved key words, which are defined as having a string or scalar value. Conceptually that seems odd to me.

Otherwise, I would be fine with flattening meta-data keys, that would make it relatively easy to use nested meta-data in filters and in practice solve most problems.

Also, are there that many places in the code base where non-hashable types will cause problems? Perhaps we could also just ignore non-hashable types in those places (versus making them hashable).

oesteban commented 3 years ago

I would rather say that BIDS doesn't disallow or even discourage defining new metadata fields with three name of an existing entity. While this is relatively obvious, it will easily become a rabbit hole with derivatives.

Flattening solves the DB face of things. A more robust solution would be that entity regexps were applied to filter all these shorthand queries.

adelavega commented 3 years ago

Agreed its a spec problem, but can you elaborate what you mean by "shorthand queries"?

Without flattening, if somebody said .get(return_type='id', target='nested_metadata_field'), that would still cause problems right?

oesteban commented 3 years ago

Agreed its a spec problem, but can you elaborate what you mean by "shorthand queries"?

Sorry, I mean get_*s - like get_sessions.

Without flattening, if somebody said .get(return_type='id', target='nested_metadata_field'), that would still cause problems right?

Probably not, because there's nothing to set() to filter out repetitions - so no need for hashing anything.

adelavega commented 3 years ago

Ah, sounds good to me. More generally, we could say that if the target= a reserved BIDS entity keyword, it will filter results to only those that match the corresponding regex (and excludes non-hashable types).

About return_type='id'; this return unique ids, so it applies set, and would crash if there are non-hashable types returned. This is a useful query for non-BIDS entities too; for example what if you want to know what the unique TRs used in a dataset are. I think it's fair to simply exclude non-hashable types for return_type='id' or make them hashable as with your PR.

adelavega commented 3 years ago

Question: Would it ever make sense to specify a BIDS entity (i.e. run) only in a .json sidecar? Wouldn't that be illegal since it should necessarily already be specified in the filename of the corresponding file? If it's an actual entity of that file.

If so, an easier solution is that if target= an entity that should come from the file name, and not meta-data, we filter to only include non meta-data in get().

I'll push a PR with that.