datalad / datalad-metalad

Next generation metadata handling
Other
13 stars 11 forks source link

"directory" level for Metadata? #168

Open yarikoptic opened 6 years ago

yarikoptic commented 6 years ago

What is the problem?

Some datasets potentially aren't getting split into proper subdatasets for one reason (distributed as a single tarball with all those subdirectories) or another ("didn't think about it"). E.g. we have http://datasets.datalad.org/?dir=/dicoms/rosetta/ where theoretically subdasets could be at the 2nd level of subdirectories to correspond to a sample from a particular scanner. When we do queries/reporting we can only report on a file or dataset and not per directory.
I wondered if it it could be somehow allowed for some directories to provide their own level of aggregation (indicated e.g. by having .datalad/ subdir, possibly with .datalad/config which would prescribe extractors) Note: that a dataset is also a directory ;)

mih commented 6 years ago

If I get this right the issue is reporting of directory-level metadata. Any search implementation is free to build partial metadata representations and index document of types other than file or dataset. On the aggregation side of things adding a directory layer seems to lead to additional duplication of information, beyond the duplication that is already happening in the flattened file metadata view of the dataset.

For this particular case here, I wonder if the ImageSeries report of the dicom extractor isn't already providing a meaningful summary?

yarikoptic commented 6 years ago

May be there is a way to generalize in "default" search implementation? But may be there could be a way to just establish a way to group documents within query and/or results somehow?

Groupping for querying probably would be tricky, since most likely would need custom index creation or even custom aggregation? But groupping of results should be quite doable.