dandi / dandisets

755 Dandisets, 815.8 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

collect stats from .git/config for all dandisets, not only the ones given for update #331

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

000108 and possibly others will need to remain running on their individual runs. Because of that description of https://github.com/dandi/dandisets/ does not incorporate the 000108 size.

I think we should just make total stats collection to not rely on returned data structures (if that is what it does ATM) for the run, but rather read the dandi.stats for each dandiset (regardless of which it worked on in that run) and update github description with the total.

jwodder commented 1 year ago

@yarikoptic Note that the code for retrieving dandi.stats currently falls back to recomputing the stats if they're out of date. Should that behavior apply for this as well? If not, what should happen if a Dandiset lacks dandi.stats (say, because the stats are being collected by a run on 000108 while a separate run on all other Dandisets just started work on a new, never-before-backed-up Dandiset)?

yarikoptic commented 1 year ago

@yarikoptic Note that the code for retrieving dandi.stats currently falls back to recomputing the stats if they're out of date. Should that behavior apply for this as well?

I think that having it "read-only" operation would be best since would avoid us possible competitions etc. If not too hard - please make it avoid triggering recomputing and just retrieve in case of updating information within the "super dataset".

If not, what should happen if a Dandiset lacks dandi.stats (say, because the stats are being collected by a run on 000108 while a separate run on all other Dandisets just started work on a new, never-before-backed-up Dandiset)?

I would say -- no stats -- no accounting for.

jwodder commented 1 year ago

@yarikoptic Should the description for the main dandisets repo always be updated on every run (regardless of whether it's operating on just 000108 or all non-000108 Dandisets), or should there be a command-line option to turn this on/off?

yarikoptic commented 1 year ago

I feel like "on every run" regardless of which dandiset to operate on. If easy to add CLI option to --no-stats-update or alike to disable that -- might come handy.