Open yarikoptic opened 5 years ago
Here is a reproducible demo to develop a test:
datalad rev-create 3055
datalad install -d . -s https://github.com/datalad/testrepo--minimalds.git sub
datalad aggregate-metadata --update-mode target -r
# fails to access
datalad metadata --reporton datasets sub
# remove subds in question
datalad uninstall sub
# works now
datalad metadata --reporton datasets sub
The core issue is found here: https://github.com/datalad/datalad/blob/master/datalad/metadata/metadata.py#L953-L978
A superdataset will not be queried for metadata of a subdataset or its content when the subdataset is present.
But I think it is actually worse, as the flow in the query procedure has a somewhat limited ability to deal with this.
Datasets are processed in the order in which path annotation finds them, which limits possibilties to change this behavior (if this should be kept -- it was a requested thing at some point).
Possible fixes:
--dataset
) datasetThe second and third approach have their benefits, but likely the third is the viable path forward, as it avoid required aggregation into superdatasets for querying hierarchies of datasets. However, it is more messy then the second approach, as only this one is guaranteed to give identical results regardless of subdataset presence.
I could be totally wrong (my understanding of the flow/logic in this code is not yet superb), but couldn't it be just a matter of extending the check here with either that dataset contains any aggregated metadata (i.e. smth like
if op.lexists(op.join(ap['path], agginfo_relpath)):
to_query = ap['path']
else:
lgr.info("Dataset %(path)s is installed, but lacks aggregated metadata. Querying superdataset", ap)
to_query = ap['parentds']
More complete:
if ap.get('state', None) == 'absent' or \
ap.get('type', 'dataset') != 'dataset':
# this is a lonely absent dataset/file or content in a present dataset
# -> query through parent
# there must be a parent, otherwise this would be a non-dataset path
# and would have errored during annotation
to_query = ap['parentds']
else:
to_query = ap['path']
to
if ap.get('state', None) == 'absent' or \
ap.get('type', 'dataset') != 'dataset':
# this is a lonely absent dataset/file or content in a present dataset
# -> query through parent
# there must be a parent, otherwise this would be a non-dataset path
# and would have errored during annotation
lgr.info("Dataset %(path)s is not installed. Querying superdataset", ap)
to_query = ap['parentds']
elif ap.get('type', 'dataset') == 'dataset' and not op.lexists(op.join(ap['path], agginfo_relpath)):
lgr.info("Dataset %(path)s is installed, but lacks aggregated metadata. Querying superdataset", ap)
to_query = ap['parentds']
else:
to_query = ap['path']
This would fix the immediate issue demo'ed above, but there is no guarantee that the 'parentds' has the metadata and not the parent of the parent or even further away.
I was digging into this, although probably useless in light of https://github.com/datalad/datalad-revolution/pull/84 , I just hoped that there would be a quick fix for me... The major blow (and possibly a "hint on workaround") is actually the difference between calls metadata subds
and metadata -d . subds
. In the realm of discussion datalad/datalad#3230 - in the former (no -d
) case the "context" of path annotation completely switches over into subds
, so there is no information about parentds
in ap
record, and thus it is not possible to query it.
FTR: After some thinking in https://github.com/datalad/datalad-revolution/pull/84 I conclude that there is no single "best" way to decide which metadata to report without knowledge about the context of such a query. Consequently, metadata()
is abandoned and replaced by two dedicated commands:
extract_metadata()
- reports metadata from dataset content by running extractors. Metadata is not placed into a dataset, and no potentially existing aggregated metadata is investigated.query_metadata()
- reports based on aggregated metadata from a given (or current) dataset. No extractors are invoked, no searching for "better" metadata in any subdataset is performed.Why plain metadata
couldn't just correspond to query_metadata
?
It could, but current metadata()
does something that is not easily verbalizable, and I have no idea which aspects have any real purpose. IMHO, metadata()
can simply be dropped, or used to implement whatever more complex or automagic behavior is desirable.
For upcoming
///openneuro
I am aggregating metadata into that superdataset from subdatasets (https://github.com/datalad/datalad-crawler/pull/28/files#diff-8e8fc59a503a8bdc5f90e33e16d020b7R146), which seems to work lovely. But then I have discovered that I cannot query metadata within subdataset whenever it is installed:whenever it works fine as soon as I uninstall it