"directory" level for Metadata?

yarikoptic commented 6 years ago

What is the problem?

Some datasets potentially aren't getting split into proper subdatasets for one reason (distributed as a single tarball with all those subdirectories) or another ("didn't think about it"). E.g. we have http://datasets.datalad.org/?dir=/dicoms/rosetta/ where theoretically subdasets could be at the 2nd level of subdirectories to correspond to a sample from a particular scanner. When we do queries/reporting we can only report on a file or dataset and not per directory.
I wondered if it it could be somehow allowed for some directories to provide their own level of aggregation (indicated e.g. by having .datalad/ subdir, possibly with .datalad/config which would prescribe extractors) Note: that a dataset is also a directory ;)

mih commented 6 years ago

If I get this right the issue is reporting of directory-level metadata. Any search implementation is free to build partial metadata representations and index document of types other than file or dataset. On the aggregation side of things adding a directory layer seems to lead to additional duplication of information, beyond the duplication that is already happening in the flattened file metadata view of the dataset.

For this particular case here, I wonder if the ImageSeries report of the dicom extractor isn't already providing a meaningful summary?

yarikoptic commented 6 years ago

Indeed, explicit aggregation "per directory" would result in duplication, and necessity to specify that things should be aggregated at that particular directory or level in the path.
Re search implementations for other types -- although they could, we do have only dataset and file notions of report types hardcoded:
```
--report-type {dataset,file}
```

May be there is a way to generalize in "default" search implementation? But may be there could be a way to just establish a way to group documents within query and/or results somehow?

search at dataset level could be considered (if we forget about possible multiple installations of datasets) is just a search across documents groupping by the datalad's dataset UUID
search at a subject level is groupping based on (dataset-uuid, subject-id) pair
search across DICOMs could be asked to group at any of the InstanceUIDs (Study, Series) depending on what is desired. E.g. in the context of https://github.com/datalad/example-dicom-structural/issues/1 I wanted to see how SeriesDescription and ProtocolName differ across available data. Reporting it "per file" is "suboptimal" (huge number of files etc).

Groupping for querying probably would be tricky, since most likely would need custom index creation or even custom aggregation? But groupping of results should be quite doable.

datalad / datalad-metalad

"directory" level for Metadata? #168

What is the problem?