yarikoptic commented 6 years ago

What is the problem?

Relevant:

datalad/datalad#850 --since which is largely to figure out which datasets to perform metadata-aggregation on since they have changed
datalad/datalad#1902 --skip-aggregated is in the same vain again at the level of datasets

We talked about it before (and in the recent proposal) but I think we there is no dedicated issue for the discussion. Now that I am fetching a few TBs of data for datasets for which I have dropped that data content since we cannot actually keep all of it locally all the time (wouldn't scale), I saw the need for metadata aggregation to become "incremental" in default mode of aggregation. Not only that with --since we could skip reaggregation of datasets which didn't change since the last aggregation, but similar analysis should be done on per-file level considering the diff between the revisions. In general it could end up being a part of the --since operation within the dataset (hence --inner-since in the title ;-)), but I wanted to summarize it in a different issue since its implementation is trickier.

So it would be nice if extractors could be aware of the need to reaggregate all of the data (extractor-specific version within the extracted metadata file? ;-) ), and skip those files which were already aggregated and for which content might no longer be locally available. We could add a config or an options to get the content needed to get metadata extracted/updated (most likely that one would be locally available since re-aggregation would be happening shortly after new content added)

force reaggregation.

Support for incremental operation would be trivial for some extractors (all the file-based nifti1 etc) and not as trivial for the ones considering the files layout to be a part of the metadata (bids).

That potentially could make (re)aggregation to become feasible with each commit

Additional points which came to mind:

[ ] git-annex metadata lives in separate git-annex branch which would be constantly changing while data is get/drop'ed. We need to record the state (hexsha) of git-annex branch when metadata is aggregated for later analysis of metadata differences from the last aggregation point (i think this is not yet done)

mih commented 6 years ago

re "additional point": we currently rely on the timestamp of the last change of any annex metadata record. this should be sufficient from my POV. Please check aggregate.py:_extract_metadata() for the procedure on how the metadata object IDs are built.

mih commented 6 years ago

but similar analysis should be done on per-file level considering the diff between the revisions

I cannot say that I am against it "should" happen, but in practice this will be a nightmare. Individual extractors are using the joint file-based metadata to build dataset-level metadata (e.g. dicom), and we produce 'unique' metadata summaries automatically across all metadata. A file-based incremental aggregation would require sophisticated analysis of whether individual files contributed to those summaries prior re-aggregation. Alternatively, we would be forced to keep file-based metadata around for performing such an analysis. I don't see myself implementing such a feature.

From my POV none of these changes would "make (re)aggregation to become feasible with each commit". It would always imply that the initial cost of a full aggregation has to be paid (immediately). Many (most) of the datasets that I am working with on a daily basis have no aggregated metadata, and likely never will have, because I use them as a VCS and not as a database. Even for a use case in a DB setting, I'd argue that any aggregation cost is better paid prior publication and not each time anything is changed.

mih commented 6 years ago

Lastly: --since vs --skip-aggregated

I think the latter is more useful than the former. Assuming constant extractor behavior, we can already objectively determine whether there is any chance of metadata having changed (we track the object IDs), independent of whether the total dataset changed (i.e. --since). Moreover, I think --skip-aggregated should be default, and a --force flag should override this default behavior.

This leaves the issue of metadata extractor versions. Current master includes the global datalad version into the record. We have to keep in mind that individual aggregation records can come from different versions of datalad, even for a single hierarchy of dataset. I think keeping track of that is important.

Beyond that I think that practical constraints limit the utility of additional version info. Many (most) of our extractors are frontends for 3rd-party software. For example, I don't know if tracking the version of exempi is sufficient for guaranteeing constant output for the xmp extractor. Hence, I see little benefit from including extractor versions in addition to the datalad version.

mih commented 6 years ago

Please see datalad/datalad#2326 for a related RF.

christian-monch commented 2 years ago

With the new metalad implementation the issue of incremental updates of stored metadata has come into focus again. A discussion was started in the context of metadata extraction for the datalad-debian project. See the post that kicked the discussion off:

https://github.com/psychoinformatics-de/datalad-debian/issues/30#issue-1242862383

And the following posts:

https://github.com/psychoinformatics-de/datalad-debian/issues/30#issuecomment-1132711284 https://github.com/psychoinformatics-de/datalad-debian/issues/30#issuecomment-1132737440 https://github.com/psychoinformatics-de/datalad-debian/issues/30#issuecomment-1132754906 https://github.com/psychoinformatics-de/datalad-debian/issues/30#issuecomment-1132763963

Short summary of the posts:

Problem

Let us assume we have dataset-level metadata that is extracted by visiting all files of a dataset (here: metadata for datalad-debian datasets). If dataset-level metadata is present and the dataset is modified, e,g, by adding or modifying one file, how can we update the existing dataset-level metadata without re-extracting it.

Possible solutions

My suggestion was to read the existing metadata via meta-dump, add the metadata for the new file, and store the metadata again, i.e. read-modify-write cycle. This could be done in the following way:

The extractor has recorded the last state in the metadata and is able to determine which files were added/changed. In this case, it just reads the existing metadata, extracts metadata from the added files, adds that to the read metadata, and stores the metadata again with an updated last-extraction-state.
The extractor has an optional parameter that would contain a parameter that allows specifying a set of files that should be extracted. This requires an external entity that would determine which files should be considered by the extractor. Again the extractor will read the current metadata, modify it with the newly extracted metadata, and write it back. It should be noted that the parameter should not appear in the result of the get_state()-method. Otherwise, multiple metadata records might be stored and it might be difficult to determine which ones are relevant, i.e. recent.

It should be pointed out that this discussion is mostly relevant to updating dataset-level metadata. File-level metadata can be manipulated individually and if a metadata object is created by inspecting only a single file, can easily be

datalad / datalad-metalad

aggregate-metadata --incremental? (AKA --inner-since) #166

What is the problem?

Additional points which came to mind:

Short summary of the posts:

Problem

Possible solutions