Use git-notes instead of per-branch metadata?

mih commented 4 years ago

ATM we put metadata in a (or each) branch. This might have been an implicit choice, because for a while there were only dataset with a single main consumption branch. Now we have datasets like the UKB one that have multiple equally relevant branches.

We should make a conscious decision whether or not to move all metadata into a separate (git-annex like) branch.

yarikoptic commented 4 years ago

I think we should indeed support having metadata on some dedicated branch(es?). Another use case would be - a super dataset with heavy tree of subdatasets. Users who are not even interested in metadata would need to populate their file tree with all the dangling symlinks etc.
"packing" strategy (which is also being discussed/WiP) of metadata files also might be what could mitigate this issue. But now I wonder if "custom packing" is needed at all in case if we rely on "git tree" of that metadata branch to "pack" things up neatly for us, while we would still just keep metadata per dataset in separate files. git annex get on keys corresponding to metadata files (in the branch) would be used to get them -- so we would just need to provide a thin shim to "get content for a file (might be in git or annex) in that metadata branch".

Some immediate thoughts:

"merge" of the metadata branch needs to be implemented by datalad update
- I think we can keep .datalad/metadata/aggregate_v1.json (or whatever that index is) within the "original" branch (as is now) and then have metadata branch just contain those .datalad/metadata/objects/. Then upon merge we would need to re-aggregate metadata anyways I guess since new committish would be different etc.
git annex unused would need to be used with more caution to not drop referenced metadata

mih commented 4 years ago

There is an argument to be made for a full move of all metadata related content to a dedicated branch: In that case, a metadata update is transparent to a super/sub dataset relationship. If there would not be a dedicated git-annex branch that can change at any time, we would go nuts updating submodule records. The same is doomed to happen with metadata, unless with switch to a dedicated branch and keep everthing in it.

bpoldrack commented 4 years ago

Sounds reasonable, but needs at least internal helpers to commit something into a not checked out branch. We don't want to switch checkouts to save a new metadata aggregation, I think. It can be very expensive and it's a mess if something goes wrong and we leave the user with the "wrong" checkout.

yarikoptic commented 4 years ago

yeap, I think we might benefit from helpers also for

.get(path, branch=BRANCH) - i.e. teaching our get to operate on paths in the branch (get the locked or unlocked key and pass it into annex get?)
git show BRANCH:path - to dump content of the file in a branch.

christian-monch commented 4 years ago

I think it is a great idea.

mih commented 4 years ago

So merge conflict will be an issue. But not for individual metadata objects. The are already stored with the own content hash as filename. So if, and only if, metadata are identicial, they will end of in the same file.

But the equivalent of aggregate.json also needs to move into such a branch, and track all metadata from all branches, and must not conflict on merge.

I think this needs a really clever idea re structure, and it is unlikely to be JSON

yarikoptic commented 4 years ago

But the equivalent of aggregate.json also needs to move into such a branch, and track all metadata from all branches ...

why "move"? or be present there at all? IMHO handling of those metadata objects should be similar to how git annex unused works, so at some point someone does maintenance and prunes those are no longer used by any branch of interest.

As for merge conflicts -- I think it should be a job of datalad update to "merge" the aggregates from multiple branches. The naive merge would be -- drop them all and issue log message that metadata needs to be re-extracted/aggregated. Smartish one - at least could pick up correct ones per subdataset where update did not cause any change of the corresponding subdataset. In case of any actual merge in subdataset happening -- metadata would need to be re-extracted/aggregated again anyways I guess.

mih commented 4 years ago

But the equivalent of aggregate.json also needs to move into such a branch, and track all metadata from all branches ...

why "move"?

It has to move, because keeping information that potentially changes with each aggregation in a user-branch, invalidates the advantage of not having to update superdatasets.

or be present there at all? IMHO handling of those metadata objects should be similar to how git annex unused works, so at some point someone does maintenance and prunes those are no longer used by any branch of interest.

I am not talking about the metadata objects themselves, for those a mechanism like "unused" already exists. I am talking about the knowledge base that this unused mechanism operates on. If we employ the analog "meta data extractor" == "special remote" for a second: git-annex keeps information on the special remote parameterization in the git-annex branch, and so we will have to do as well. ATM this is all in aggregate.json in the user-branch, and it cannot stay there for obvious reasons, but it also cannot be abandoned.

As for merge conflicts -- I think it should be a job of datalad update to "merge" the aggregates from multiple branches. The naive merge would be -- drop them all and issue log message that metadata needs to be re-extracted/aggregated. Smartish one - at least could pick up correct ones per subdataset where update did not cause any change of the corresponding subdataset. In case of any actual merge in subdataset happening -- metadata would need to be re-extracted/aggregated again anyways I guess.

Again, my concern is not about the metadata objects, but about the linkage between user-branch, extractor information, and metadata objects. it is conceivable that an update would pull in metadata objects generate by different extractor configuration. Suggesting re-extraction is simple, but impossible to implement in practice in virtually all but a "I have the world's data locally" scenario. I am not saying that it is impossible, but it is not trivial.

After some more thinking, I believe the key question to answer is the following: If all metadata lives in a single manages branch, how will we support the constrained publication of datasets. ATM I can have a branch with private information and a public branch, and the only thing that leaks when I publish the public one, but not the private one, are the annex keys of the metadata objects. Pushing a subset of metadata objects is easy by pushing just files in a single branch. How would that work with a single metadata branch?

christian-monch commented 4 years ago

Just my 2 cents w.r.t. merge-conflicts resolution for the top-level index, i.e. currently in "aggregate_v1.json", which basically consists of records containing dataset-paths, extractor-info, generic dataset-info, and one pointer to a dataset-level metadata file and one pointer to a file-level metadata file.

We assume that metadata files do not collide. That means that adding different metadata files, will not affect existing metadata files. Therefore we do not expect conflicts on the metadata file strata and can ignore those for now.

If I understand correctly, we expect the following conflict types for the index:

record deleted in our commit, modified in your commit
record modified in our commit, deleted in your commit
record modified in both commits (with different results)

Modifications are not random, but would typically entail changes in the extractor configuration or in the generic dataset info, as @mih pointed out, and as a result probably a change in the metadata file names.

I think one possible way to resolve those conflicts automatically would be:

1 Use a line based index-format

Use the union-merge strategy
Run a post-merge process that combines duplicated (identified by identical dataset-paths) entries.

One problem in step 3 is to decide which extract result to use if one extractor is employed in different versions or with different parameters. A technical solution would be to keep both descriptions and both results around, for example, by associating them through an index in the combined record, e.g. "metalad_core[0]" and "metalad_core[1]". Although that should work, we would end up with two different metadata sets from one extractor-type, and we would have to determine which metadata set should be kept, or whether both should be kept. This requires additional information and we could emit a warning and leave it to the user to decide.

christian-monch commented 4 years ago

[...] e.g. "metalad_core[0]" and "metalad_core[1]" [...]

Or maybe use a hash over a unified representation of the configuration, e.g. metalad_core.923efa[...]98a

mih commented 4 years ago

I think switching the index to line-based makes sense. It should likely be accompanied by a uniform move of all pieces (file-based metadata already is JSON lines, dataset-level metadata should become that, and the index too).

mih commented 4 years ago

I think I am also not convinced that a branch is a useful tool in this context. We are not interested in the history of metadata (latest is almost always greatest), and especially in large datasets each large changeset will be an issue (even if it is just annex keys changing all the time).

Maybe a set of git entities (blobs? commits?) that is rebuilt and replaced on merge is a better concept here.

Need to read about "notes" again. It has many or all pieces we might need (merges, file tree management, garbage collection when the annotated entities are gone, etc.), Q is where the blobs end up, whether we can prevent unconditional fetches, etc. Another big Q is where to attach notes on subdatasets (when aggregating upwards). Simplest would be to put EVERTHING associated with a subdataset and its content into a single note, but I doubt this would scale. But maybe we can have a git-native representation of an indirect 2nd+ level subdataset somehow, and attach it there. Or we explore the possibility to maintain a dedicated notes-ref for each subdataset (including indirect ones).

Edit: I played with notes a bit. Some oddities (cannot get them to prune, as I think they should), but in general the overlap with our requirements is massive. Worth exploring in full.

Notes are not fetched by default. But they can be (no surprise) fetched ref by ref.

mih commented 4 years ago

But maybe we can have a git-native representation of an indirect 2nd+ level subdataset somehow

Following the example in the docs:

$ cc *.c
$ blob=$(git hash-object -w a.out)
$ git notes --ref=built add --allow-empty -C "$blob" HEAD

We could write (what is now) the aggregate_v1.json record of a subdataset into the target dataset's object database, and attach the actual metadata as a note to it.

This approach would limit our custom tracking to these objects, that we can rediscover by hash. If I am not missing something this would make a full implementation pretty lean. We could split the notes refs by depth, so we can fine-tune the fetching.

yarikoptic commented 4 years ago

Some oddities (cannot get them to prune, as I think they should)

As (was) with submodules, I think notes feature is not that popular, so I won't be surprised it to have more bugs than core git functionality - might be worth checking with git people when running into something unexpected.

As for not needing history - benefit of history on text files is diff between states and thus efficient objects store. I think that generally we do want to make it possible to access metadata for any/previous releases/states of the dataset.

mih commented 4 years ago

As for not needing history - benefit of history on text files is diff between states and thus efficient objects store. I think that generally we do want to make it possible to access metadata for any/previous releases/states of the dataset.

This probably refers to my thoughts before git-notes, with git-notes metadata would of course stay attached as long as the object it describes still exists. However, the big difference is that we can distinguish between a change in metadata due to the extractor output, and a change in metadata due to a change in the stuff tracked by git (even if only at the level of the primary datasets).

mih commented 4 years ago

Long discussion on a branch-less, potentially notes-based metadata representation. Summary to follow. But a brief post-hoc realization to record (courtesy of @bpoldrack):

When we no longer have an unambiguous association of metadata to data anymore, we loose reproducibility for any process that involves metadata queries. This situation isn't new. For example, git-annex metadata can change without a change in a user-facing branch, right now.

To compensate for that, why might need to consider to (optionally) annotate run-commits with the specific state of metadata objects.

christian-monch commented 3 years ago

Commonly agreed development has moved away from notes.

datalad / datalad-metalad

Use git-notes instead of per-branch metadata? #60