Closed yarikoptic closed 1 year ago
@yarikoptic Couldn't we use dataset-specific git config
to store information on what's been fully populated and what hasn't?
sure -- especially since we already do that for stats. It would be a bit slower to go through all of them just to decide that e.g. nothing to be done, but I think it would be not that expensive.
as now we have hundreds of dandisets and thousands of zarrs, it is wasteful and eventually might become prohibitive to run
annex move
command in each one of them even without them having had any change since the last time. I see possible approaches:1.
Centralize knowledge on candidates forpopulate{,-zarr}
, so whenever any of them ispopulated
fully (i.e.move
command ran and found nothing to bemove
d) - add that path to the registry.populate*
commands would consult to skip those which are known to be fully populated.update
command would remove from that registry if dataset incurred any change. Cons: needs some centralized/locked with interprocess/threaded lock DB. possible race condition (that is why I described to add only whenever "nothing to move" and not when "move has just completed" to minimize the raciness.git/dandi/state.yaml
of the superdataset with records for dandisets and dandizarrs -- no need for that file to be under git VCS2.
Rely on some fscacher like fingerprinting of.git/heads
so ifmoved
and no changes to.git/heads
-- no need to move again. cons: still would need some persistent storage, would need somegc
over it, would overall be slower since would need filesystem traversal through all candidate datalad datasets@jwodder -- any other ideas? I am leaning toward the 1.Let's proceed with 1.