Proposal for backing up 000108

jwodder commented 2 years ago

Given the massive size of 000108 and the repeated problems encountered while trying to back it up, I propose we do it in pieces instead of trying to get everything to work all at once. Specifically:

We should get a full, working backup of 000108 without Zarrs
- Add an option to update-from-backup for turning off Zarr syncing, and use it when backing up 000108
We back up each Zarr in 000108 one at a time, and once a Zarr is successfully backed up, we add it as a submodule to the 000108 dataset and commit it.
- Add a subcommand to backups2datalad that takes a Dandiset ID and backs up just the Zarrs and commits them to the Dandiset dataset one by one. This command should skip any Zarrs already backed up, so we just keep rerunning it until they're all done.

yarikoptic commented 2 years ago

don't we need zarrs when establishing the history of the dandiset?
it is ok for be to have a command to first create zarrs so update would be able to use pretreated ones. The problem though I think that Zarr fetching was crashing too, so would we be able to ever finish?

jwodder commented 2 years ago

@yarikoptic

don't we need zarrs when establishing the history of the dandiset?

What do you mean? We need to get commit timestamps in the right order in case 108 is ever published, but other than that, I don't see a problem with the dataset temporarily lacking some assets. Is it really necessary that the commit history reflect the Dandiset's update history?

The problem though I think that Zarr fetching was crashing too, so would we be able to ever finish?

A number of Zarrs have managed to finish, at least, such as (randomly selected sample):

I'm still not sure why the most recent backup attempt crashed, but I don't expect it to happen every time.

yarikoptic commented 2 years ago

I'm still not sure why the most recent backup attempt crashed, but I don't expect it to happen every time.

well, that was my experience so far that it would die one way or another so far so we never got a complete pass (#192, #203), so I would not NOT expect it to happen every time until we catch and address all the issues which we encounter leading it to a crash.

don't we need zarrs when establishing the history of the dandiset?

What do you mean? We need to get commit timestamps in the right order in case 108 is ever published, but other than that, I don't see a problem with the dataset temporarily lacking some assets. Is it really necessary that the commit history reflect the Dandiset's update history?

not strictly required but would be nice I guess. If we disable zarrs in that dataset, I am not sure how useful it would be since all the hdf5's probably would be gone too. So I think we could just postpone until all zarrs are produced first, and then we run the 'update' one to "install" them into that dataset.

jwodder commented 2 years ago

@yarikoptic How do I properly clone a dataset into a superdataset so that either (a) the clone operation does not result in an immediate commit, or (b) the commit message & timestamp are set to my liking? Using datalad.api.clone(source=..., path=...) does not update the superdataset's .gitmodules (except it magically does update it when cloning during a Dandiset sync; not sure why), while using ds.clone(source=..., path=...) results in an automatic commit, with no clear way to configure the commit message and timestamp.

yarikoptic commented 2 years ago

Use that datalad.api.clone(source=..., path=...) and then straight git submodule add and commit with dates/message? Or if we can tune that for da.save then that one could be used instead and it should add that to . gitmodules

jwodder commented 2 years ago

@yarikoptic I think I found why cloning subdatasets was working in one case but not the other, and it looks like a bug in DataLad.

Setup:

datalad create foo
cd foo
echo 'This is a file.' > file.txt
datalad save -m 'Add a file'
cd ..

If you clone and then save without specifying a path, the save fills in .gitmodules:

datalad create super
datalad clone foo super/foo
cd super
datalad save -m 'Add a subdataset'
# foo is now a submodule

but if you do "datalad save foo" instead:

datalad create super
datalad clone foo super/foo
cd super
datalad save -m 'Add a subdataset' foo

then nothing is saved!

Is this a bug in DataLad, or is there some secret way to tell "save" to convert just foo to a subdataset and save it?

jwodder commented 2 years ago

@yarikoptic

Use that datalad.api.clone(source=..., path=...) and then straight git submodule add and commit with dates/message?

Doing git submodule add $source_url $path after cloning doesn't set the datalad-id field in .gitmodules.

Or if we can tune that for da.save then that one could be used instead and it should add that to . gitmodules

I can't tell what you're trying to say here.

jwodder commented 2 years ago

@yarikoptic I've managed to get cloning to work by doing git submodule add followed by directly extracting the dataset ID and adding it to .gitmodules. I've filed the "save" behavior I described above as a bug in DataLad: https://github.com/datalad/datalad/issues/6775

jwodder commented 2 years ago

@yarikoptic I've finished running prepare-partial-zarrs.sh successfully. The next step is to actually run the backup-zarrs command, which can be done with the backups2datalad-backup-zarrs-108 script in tools/.

yarikoptic commented 2 years ago

Great! Please run it in screen.

yarikoptic commented 2 years ago

I believe this proposal was implemented (ATM we are fighting new daemons due to size/etc)

dandi / dandisets

Proposal for backing up 000108 #213