datalad / datalad-metalad

Next generation metadata handling
Other
13 stars 11 forks source link

Superds fails to find subds metadata whenever subds doesn't have it aggregated and is installed #173

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

For upcoming ///openneuro I am aggregating metadata into that superdataset from subdatasets (https://github.com/datalad/datalad-crawler/pull/28/files#diff-8e8fc59a503a8bdc5f90e33e16d020b7R146), which seems to work lovely. But then I have discovered that I cannot query metadata within subdataset whenever it is installed:

$> datalad ls ds000008 
ds000008   [annex]  master  ✗ 2018-07-13/21:41:18  ✓

$> datalad -f json_pp metadata --reporton all ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz
[WARNING] Found no aggregated metadata info file /mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/.datalad/metadata/aggregate_v1.json. Found following info files, which might have been generated with newer version(s) of datalad: .datalad/metadata/aggregate_v1.json. You will likely need to either update the dataset from its original location,, upgrade datalad or reaggregate metadata locally. 
[WARNING] Dataset at . contains no aggregated metadata on this path [metadata(/mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz)] 
{
  "action": "metadata", 
  "path": "/mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz", 
  "status": "impossible", 
  "type": "file"
}

whenever it works fine as soon as I uninstall it

$> datalad uninstall ds000008 
uninstall(ok): /mnt/btrfs/scrap/datalad/openneuro-samples/ds000008 (dataset)
action summary:
  drop (notneeded: 1)
  uninstall (ok: 1)

$> datalad -f json_pp metadata --reporton all ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz | head -n 10
{
  "action": "metadata", 
  "dsid": "ada6363e-8706-11e8-b2f9-0242ac12001e", 
  "metadata": {
    "@context": {
      "@vocab": "http://docs.datalad.org/schema_v2.0.json"
    }, 
    "annex": {
      "key": "MD5E-s592833--8d37e298f8afe55cfd04a19870399198.nii.gz"
    }, 
...
mih commented 5 years ago

Here is a reproducible demo to develop a test:

datalad rev-create 3055
datalad install -d . -s https://github.com/datalad/testrepo--minimalds.git sub
datalad aggregate-metadata --update-mode target -r
# fails to access
datalad metadata --reporton datasets sub
# remove subds in question
datalad uninstall sub
# works now
datalad metadata --reporton datasets sub
mih commented 5 years ago

The core issue is found here: https://github.com/datalad/datalad/blob/master/datalad/metadata/metadata.py#L953-L978

A superdataset will not be queried for metadata of a subdataset or its content when the subdataset is present.

But I think it is actually worse, as the flow in the query procedure has a somewhat limited ability to deal with this.

  1. Paths are sorted under locally present datasets
  2. For each discovered dataset the closest dataset (upwards) is asked for metadata, if that doesn't have it, the query fails

Datasets are processed in the order in which path annotation finds them, which limits possibilties to change this behavior (if this should be kept -- it was a requested thing at some point).

Possible fixes:

The second and third approach have their benefits, but likely the third is the viable path forward, as it avoid required aggregation into superdatasets for querying hierarchies of datasets. However, it is more messy then the second approach, as only this one is guaranteed to give identical results regardless of subdataset presence.

yarikoptic commented 5 years ago

I could be totally wrong (my understanding of the flow/logic in this code is not yet superb), but couldn't it be just a matter of extending the check here with either that dataset contains any aggregated metadata (i.e. smth like

if op.lexists(op.join(ap['path], agginfo_relpath)):
    to_query = ap['path']
else:
    lgr.info("Dataset %(path)s is installed, but lacks aggregated metadata. Querying superdataset", ap)
    to_query = ap['parentds']
yarikoptic commented 5 years ago

More complete:

            if ap.get('state', None) == 'absent' or \
                    ap.get('type', 'dataset') != 'dataset':
                # this is a lonely absent dataset/file or content in a present dataset
                # -> query through parent
                # there must be a parent, otherwise this would be a non-dataset path
                # and would have errored during annotation
                to_query = ap['parentds']
            else:
                to_query = ap['path']

to

            if ap.get('state', None) == 'absent' or \
                    ap.get('type', 'dataset') != 'dataset':
                # this is a lonely absent dataset/file or content in a present dataset
                # -> query through parent
                # there must be a parent, otherwise this would be a non-dataset path
                # and would have errored during annotation
                lgr.info("Dataset %(path)s is not installed. Querying superdataset", ap)
                to_query = ap['parentds']
            elif ap.get('type', 'dataset') == 'dataset' and not op.lexists(op.join(ap['path], agginfo_relpath)):
               lgr.info("Dataset %(path)s is installed, but lacks aggregated metadata. Querying superdataset", ap)
               to_query = ap['parentds']
            else:
                to_query = ap['path']
mih commented 5 years ago

This would fix the immediate issue demo'ed above, but there is no guarantee that the 'parentds' has the metadata and not the parent of the parent or even further away.

yarikoptic commented 5 years ago

I was digging into this, although probably useless in light of https://github.com/datalad/datalad-revolution/pull/84 , I just hoped that there would be a quick fix for me... The major blow (and possibly a "hint on workaround") is actually the difference between calls metadata subds and metadata -d . subds. In the realm of discussion datalad/datalad#3230 - in the former (no -d) case the "context" of path annotation completely switches over into subds, so there is no information about parentds in ap record, and thus it is not possible to query it.

mih commented 5 years ago

FTR: After some thinking in https://github.com/datalad/datalad-revolution/pull/84 I conclude that there is no single "best" way to decide which metadata to report without knowledge about the context of such a query. Consequently, metadata() is abandoned and replaced by two dedicated commands:

  1. extract_metadata() - reports metadata from dataset content by running extractors. Metadata is not placed into a dataset, and no potentially existing aggregated metadata is investigated.
  2. query_metadata() - reports based on aggregated metadata from a given (or current) dataset. No extractors are invoked, no searching for "better" metadata in any subdataset is performed.
yarikoptic commented 5 years ago

Why plain metadata couldn't just correspond to query_metadata?

mih commented 5 years ago

It could, but current metadata() does something that is not easily verbalizable, and I have no idea which aspects have any real purpose. IMHO, metadata() can simply be dropped, or used to implement whatever more complex or automagic behavior is desirable.