meta-aggregate: Unnecessary forced extraction

bpoldrack commented 5 years ago

Discovered via https://github.com/psychoinformatics-de/datalad-hirni/issues/117

I have a subdataset that already has up-to-date metadata aggregated in itself. The superdataset has no metadata yet whatsoever. Now, datalad meta-aggregate sub/ --into top on the superdataset, leads to forced re-extraction of metadata in the subdataset, instead of just getting the already available metadata.

Working on providing a test (and hopefully a fix).

bpoldrack commented 5 years ago

So, after reproducing the issue with a test in PR #21 , I'm still trying to fix it. Turns out to not be easy to get a grasp on how aggregate works, mostly because a lot of its logic is kind of implicit.

I tried to iron that out by starting to refactor _do_top_aggregation. The aim is to make the logic more explicit as well as basically rewriting the important part to cover those cases as well. As of now, I failed (breaking other tests as soon as I get the new pone to pass). There are several issues in understanding implications and logic around that code block.

Generally AFAICT there wasn't any code path to do, what we ant to see in that case. It's not just some incomplete conditions. Determination of whether or not extraction is needed was solely based on the state of things as they recorded in the top dataset. The state of the source dataset wasn't considered at all. This is why I introduced to also diff against the recorded state in source dataset and check whether it has not up-to-date extractor config, both of which was done for top dataset only. So, my understanding is that the logic should first check whether top's record is somehow outdated. If so, we know we need to update it but not yet where from. In that case figure whether source dataset has a record and whether that one is outdated as well. Then we know, whether we need to extract or we can use the record in source.

The first thing is to understand what exactly extract_from_ds is representing. In particular when exactly the values contain a Dataset and when a Path instance. There are comments about that in __call__ itself, but I don't quite get it. This is mostly about the cleaning of it and its special treatment if aggregate was called with --recursive. This is closely related to the determination of use_self_aggregate, which can only ever be true, if it was recursive. It seems a bit like a special case treatment in a recursion where only an intermediate dataset is actually available, but I think this notion is close but not yet quite right. I can't include that special case in a different way as long as I don't exactly know what that special case is. Also it's not entirely clear what its effect is, since it's only used to not do things it remains not entirely clear what it really does. This includes its naming. What is "self"? top, src, anything?

After possible extraction of metadata, comments talk about "aggsubjs", which is not a variable, and when to treat them how. This is not entirely clear to me as well. There are at least two aspects to it: a) the path matching against what is in top's agginfo. Why would you throw away what's not yet in top? So, when looking at that with Michael, it seemed to make sense to match against src instead. However, that didn't work (in that it caused other tests to fail). So, I think it has to be either top or src depending on ... well, what exactly? ;-)

Last crucial bit: What I mostly broke with several trials to solve it was the removing of obsolete things from top's metadata. Something with the logic about that evades me. Might well be that understanding this is actually the only crucial piece missing for a fix.

Please also note, that there is a temporary commit to look at (https://github.com/datalad/datalad-metalad/pull/21/commits/26a56551c665380362f13022030cacdd602fc7d8)

adswa commented 4 years ago

I think I found a solution. Will issue a PR on top of yours shortly.

datalad / datalad-metalad

meta-aggregate: Unnecessary forced extraction #20