Closed jwodder closed 2 years ago
@yarikoptic
don't we need zarrs when establishing the history of the dandiset?
What do you mean? We need to get commit timestamps in the right order in case 108 is ever published, but other than that, I don't see a problem with the dataset temporarily lacking some assets. Is it really necessary that the commit history reflect the Dandiset's update history?
The problem though I think that Zarr fetching was crashing too, so would we be able to ever finish?
A number of Zarrs have managed to finish, at least, such as (randomly selected sample):
I'm still not sure why the most recent backup attempt crashed, but I don't expect it to happen every time.
I'm still not sure why the most recent backup attempt crashed, but I don't expect it to happen every time.
well, that was my experience so far that it would die one way or another so far so we never got a complete pass (#192, #203), so I would not NOT expect it to happen every time until we catch and address all the issues which we encounter leading it to a crash.
don't we need zarrs when establishing the history of the dandiset?
What do you mean? We need to get commit timestamps in the right order in case 108 is ever published, but other than that, I don't see a problem with the dataset temporarily lacking some assets. Is it really necessary that the commit history reflect the Dandiset's update history?
not strictly required but would be nice I guess. If we disable zarrs in that dataset, I am not sure how useful it would be since all the hdf5's probably would be gone too. So I think we could just postpone until all zarrs are produced first, and then we run the 'update' one to "install" them into that dataset.
@yarikoptic How do I properly clone a dataset into a superdataset so that either (a) the clone operation does not result in an immediate commit, or (b) the commit message & timestamp are set to my liking? Using datalad.api.clone(source=..., path=...)
does not update the superdataset's .gitmodules
(except it magically does update it when cloning during a Dandiset sync; not sure why), while using ds.clone(source=..., path=...)
results in an automatic commit, with no clear way to configure the commit message and timestamp.
Use that datalad.api.clone(source=..., path=...)
and then straight git submodule add and commit with dates/message? Or if we can tune that for da.save then that one could be used instead and it should add that to . gitmodules
@yarikoptic I think I found why cloning subdatasets was working in one case but not the other, and it looks like a bug in DataLad.
Setup:
datalad create foo
cd foo
echo 'This is a file.' > file.txt
datalad save -m 'Add a file'
cd ..
If you clone and then save without specifying a path, the save fills in .gitmodules:
datalad create super
datalad clone foo super/foo
cd super
datalad save -m 'Add a subdataset'
# foo is now a submodule
but if you do "datalad save foo" instead:
datalad create super
datalad clone foo super/foo
cd super
datalad save -m 'Add a subdataset' foo
then nothing is saved!
Is this a bug in DataLad, or is there some secret way to tell "save" to convert just foo
to a subdataset and save it?
@yarikoptic
Use that
datalad.api.clone(source=..., path=...)
and then straight git submodule add and commit with dates/message?
Doing git submodule add $source_url $path
after cloning doesn't set the datalad-id
field in .gitmodules
.
Or if we can tune that for da.save then that one could be used instead and it should add that to . gitmodules
I can't tell what you're trying to say here.
@yarikoptic I've managed to get cloning to work by doing git submodule add
followed by directly extracting the dataset ID and adding it to .gitmodules
. I've filed the "save" behavior I described above as a bug in DataLad: https://github.com/datalad/datalad/issues/6775
@yarikoptic I've finished running prepare-partial-zarrs.sh
successfully. The next step is to actually run the backup-zarrs
command, which can be done with the backups2datalad-backup-zarrs-108
script in tools/
.
Great! Please run it in screen.
I believe this proposal was implemented (ATM we are fighting new daemons due to size/etc)
Given the massive size of 000108 and the repeated problems encountered while trying to back it up, I propose we do it in pieces instead of trying to get everything to work all at once. Specifically:
update-from-backup
for turning off Zarr syncing, and use it when backing up 000108