Make "populate" more efficient and not consider datasets already "populated"

yarikoptic commented 2 years ago

as now we have hundreds of dandisets and thousands of zarrs, it is wasteful and eventually might become prohibitive to run annex move command in each one of them even without them having had any change since the last time. I see possible approaches:

1. Centralize knowledge on candidates for populate{,-zarr}, so whenever any of them is populated fully (i.e. move command ran and found nothing to be moved) - add that path to the registry. populate* commands would consult to skip those which are known to be fully populated. update command would remove from that registry if dataset incurred any change. Cons: needs some centralized/locked with interprocess/threaded lock DB. possible race condition (that is why I described to add only whenever "nothing to move" and not when "move has just completed" to minimize the raciness
- centralize within .git/dandi/state.yaml of the superdataset with records for dandisets and dandizarrs -- no need for that file to be under git VCS
2. Rely on some fscacher like fingerprinting of .git/heads so if moved and no changes to .git/heads -- no need to move again. cons: still would need some persistent storage, would need some gc over it, would overall be slower since would need filesystem traversal through all candidate datalad datasets

~~@jwodder -- any other ideas? I am leaning toward the 1.~~ Let's proceed with 1.

jwodder commented 1 year ago

@yarikoptic Couldn't we use dataset-specific git config to store information on what's been fully populated and what hasn't?

yarikoptic commented 1 year ago

sure -- especially since we already do that for stats. It would be a bit slower to go through all of them just to decide that e.g. nothing to be done, but I think it would be not that expensive.

dandi / dandisets

Make "populate" more efficient and not consider datasets already "populated" #255