Closed yarikoptic closed 1 year ago
@yarikoptic Note that the code for retrieving dandi.stats
currently falls back to recomputing the stats if they're out of date. Should that behavior apply for this as well? If not, what should happen if a Dandiset lacks dandi.stats
(say, because the stats are being collected by a run on 000108 while a separate run on all other Dandisets just started work on a new, never-before-backed-up Dandiset)?
@yarikoptic Note that the code for retrieving
dandi.stats
currently falls back to recomputing the stats if they're out of date. Should that behavior apply for this as well?
I think that having it "read-only" operation would be best since would avoid us possible competitions etc. If not too hard - please make it avoid triggering recomputing and just retrieve in case of updating information within the "super dataset".
If not, what should happen if a Dandiset lacks
dandi.stats
(say, because the stats are being collected by a run on 000108 while a separate run on all other Dandisets just started work on a new, never-before-backed-up Dandiset)?
I would say -- no stats -- no accounting for.
@yarikoptic Should the description for the main dandisets repo always be updated on every run (regardless of whether it's operating on just 000108 or all non-000108 Dandisets), or should there be a command-line option to turn this on/off?
I feel like "on every run" regardless of which dandiset to operate on. If easy to add CLI option to --no-stats-update
or alike to disable that -- might come handy.
000108 and possibly others will need to remain running on their individual runs. Because of that description of https://github.com/dandi/dandisets/ does not incorporate the 000108 size.
I think we should just make total stats collection to not rely on returned data structures (if that is what it does ATM) for the run, but rather read the
dandi.stats
for each dandiset (regardless of which it worked on in that run) and update github description with the total.