dandi / dandisets

736 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

Simple queue/collision avoidance between jobs on the same dandiset #159

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

https://github.com/dandi/dandisets/blob/draft/tools/backups2datalad-cron is ATM serially goes through update-from-backup populate dandi-dandisets-dropbox populate-zarrs . All those jobs should be parallelizable in particular across different dandisets. ATM one long running or stuck job delays running any next one.

I think we can have separate cron jobs and run them in parallel.

But we must avoid conflicts of two jobs working on the same dandiset do not interfere. So let's add locking within each dandiset so a next job is not started until previous one is finished. If lock is already present for a dandiset, wait for e.g. up to 5 minutes, and skip that dandiset on that run.

Ideally: establish time stamps reporting (store e.g. in .git/dandi/jobs.json) recording when each of those (related) jobs was ran last, then if unable to lock and prior run was e.g. 1 week ago , issue an error.

jwodder commented 2 years ago

@yarikoptic If there's a new Dandiset on the server that hasn't been backed up yet, it's possible that concurrent cronjobs would try to initialize a new dataset for it at the same time. Can DataLad handle that intelligently, or is some extra locking needed?

jwodder commented 2 years ago

@yarikoptic

if unable to lock and prior run was e.g. 1 week ago , issue an error.

Under what sort of circumstances do you expect this to ever happen? I can only see this happening if a Dandiset takes a week to back up, but is that error-worthy?

yarikoptic commented 2 years ago

@yarikoptic If there's a new Dandiset on the server that hasn't been backed up yet, it's possible that concurrent cronjobs would try to initialize a new dataset for it at the same time. Can DataLad handle that intelligently, or is some extra locking needed?

Datalad doesn't do locking for this. You can use superdataset's .git folder for per dandiset locks for this - it would be common location for all jobs, and we know it should exists already.

yarikoptic commented 2 years ago

if unable to lock and prior run was e.g. 1 week ago , issue an error.

Under what sort of circumstances do you expect this to ever happen?

E.g. annex get (or some other command) locks up like it did now: https://git-annex.branchable.com/bugs/get_is_busy_doing_nothing/ . Seems to be avoidable at git-annex level but needs configuration, yet to establish. And who knows if nothing else would cause a lock up

I can only see this happening if a Dandiset takes a week to back up, but is that error-worthy?

I would say "if there is unlucky star arrangement" since besides above I do not see an immediate scenario when it could happen, but because of all the parallelization and possibility of being able to not look at some dandiset for long period of time because of some odd coincidence (e.g. all parallel jobs start at the same time and look at the first dandiset first at the same time) us not getting needed updates.

jwodder commented 2 years ago

@yarikoptic

if unable to lock and prior run was e.g. 1 week ago , issue an error.

By "issue an error", do you mean "log an ERROR-level log message (and continue with further Dandisets)" or "raise an exception (and stop)"?

Also, should the jobs.json file record the start time or the end time of each operation?

Also, currently the code saves all modified datasets to the superdataset at the end of update-from-backup at once; could running in parallel cause problems here, necessitating creating a separate commit in the superdataset for each modified dataset?

yarikoptic commented 2 years ago

I would say -- exception. I want to detect/handle manually (see how we should react to them etc) for now.

Also, should the jobs.json file record the start time or the end time of each operation?

I do not see an immediate use for start but it would not hurt either and could come handy for getting some stats. So good idea to record both times.

Top level save -- I would just add locking there at the level of super dataset to avoid two parallel save invocations

NB I feel like we might add a generic interface at DataLad's Interface level to add such locking "on request".... but later

jwodder commented 2 years ago

@yarikoptic How should locking of Zarr datasets be handled? Should a Zarr ever be locked while its corresponding Dandiset is not locked? It might be easiest to merge populate-zarrs into populate and have the latter populate all Zarr submodules in a given Dandiset after populating blob assets while keeping the Dandiset locked, which would eliminate the need to lock Zarrs at all.

yarikoptic commented 2 years ago

Why don't we lock just within each zarr dandiset as we get to a specific zarr in populate-zarrs (which just transfers data IIRC) or update-from-backup which might create/update them? I do not quite see how melding populate-zarr and populate would help as they feel like independent jobs (one get/copy for dandisets, another get/copy for dandizarrs) -- where am I confused?

jwodder commented 2 years ago

@yarikoptic If we don't lock the corresponding Dandiset while populating a Zarr, then the backup command could try to back up a Dandiset whose Zarrs are currently being populated, and if this backup reaches a Zarr that is currenly locked due to being populated and the population takes more than five minutes ... what should happen?

yarikoptic commented 2 years ago

Thinking about it more: in principle, git-annex would(edit) should be just fine with get/move (of populate jobs) to be ran in parallel to all those registerurl etc of the main backup command. AFAIK git-annex journals all those changes and manages to commit into git-annex without causing a conflict etc... so, most likely those populate commands should be just fine in running in parallel to update-from-backup which could alter the tree/commit. The only gotcha IMHO might be the "push" but it is also done separately and likely shouldn't interfere since modifies only remote references. The only problematic part is then "pull" in super dataset before running "push" which could interfere with the update-from-backup process.

So, overall, I think, may be, we should just try to parallelize all three (with pull/push after update-from-backup only) first and see if any problems arise? ;) WDYT @jwodder -- may be you foresee some conflicts/race conditions I have missed?

jwodder commented 2 years ago

@yarikoptic I can't think of any conflicts.

yarikoptic commented 2 years ago

Ok, I will then just split cron job script into multiple

yarikoptic commented 2 years ago

we split into multiple, so far seems tobe ok