Open yarikoptic opened 6 years ago
If I get this right the issue is reporting of directory-level metadata. Any search implementation is free to build partial metadata representations and index document of types other than file
or dataset
. On the aggregation side of things adding a directory layer seems to lead to additional duplication of information, beyond the duplication that is already happening in the flattened file metadata view of the dataset.
For this particular case here, I wonder if the ImageSeries report of the dicom extractor isn't already providing a meaningful summary?
dataset
and file
notions of report types hardcoded:
--report-type {dataset,file}
May be there is a way to generalize in "default" search implementation? But may be there could be a way to just establish a way to group documents within query and/or results somehow?
(dataset-uuid, subject-id)
pairGroupping for querying probably would be tricky, since most likely would need custom index creation or even custom aggregation? But groupping of results should be quite doable.
What is the problem?
Some datasets potentially aren't getting split into proper subdatasets for one reason (distributed as a single tarball with all those subdirectories) or another ("didn't think about it"). E.g. we have http://datasets.datalad.org/?dir=/dicoms/rosetta/ where theoretically subdasets could be at the 2nd level of subdirectories to correspond to a sample from a particular scanner. When we do queries/reporting we can only report on a file or dataset and not per directory.
I wondered if it it could be somehow allowed for some directories to provide their own level of aggregation (indicated e.g. by having
.datalad/
subdir, possibly with.datalad/config
which would prescribe extractors) Note: that a dataset is also a directory ;)