dandi / dandisets

737 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

backup script of 000108 doing smth not clear what #269

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

last log message visible at INFO or logged at DEBUG is from over an hour ago

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad$ tail -f 2022.09.26.16.05.04Z.log
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: Dandiset 000108: Finished running `git gc`
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: Running: git remote [cwd=/mnt/backup/dandi/dandisets/000108]
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: GitHub remote already exists for 000108
2022-09-27T12:59:58-0400 [INFO    ] backups2datalad: Dandiset 000108: Tagging releases for Dandiset
2022-09-27T12:59:58-0400 [DEBUG   ] httpx._client: HTTP Request: GET https://api.dandiarchive.org/api/dandisets/000108/versions/ "HTTP/1.1 200 OK"
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: Opening pipe to `git ls-tree -lrz HEAD` [cwd=/mnt/backup/dandi/dandisets/000108]
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: Command `git ls-tree -lrz HEAD` [cwd=/mnt/backup/dandi/dandisets/000108] exited with return code 0
2022-09-27T12:59:58-0400 [DEBUG   ] backups2datalad: Opening pipe to `git-annex find '--include=*' --json` [cwd=/mnt/backup/dandi/dandisets/000108]
2022-09-27T12:59:59-0400 [DEBUG   ] backups2datalad: Waiting for `git-annex find '--include=*' --json` [cwd=/mnt/backup/dandi/dandisets/000108] to terminate
2022-09-27T12:59:59-0400 [DEBUG   ] backups2datalad: Command `git-annex find '--include=*' --json` [cwd=/mnt/backup/dandi/dandisets/000108] exited with return code 0

and nothing else is shown or logged but I do see it starting ls-files from time to time:

dandi    1138091 87.5  2.3 3700268 1539192 pts/17 Sl+ Sep26 1370:56       python -m tools.backups2datalad -l INFO --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 1 000108
dandi     784496  9.0  0.0  18812 11508 pts/17   S+   14:11   0:00         git -c diff.ignoreSubmodules=none ls-files --stage -z

@jwodder -- any idea on what it is doing and if there is a hope for it to finish doing it any time in the future ? ;)

jwodder commented 1 year ago

@yarikoptic Presumably, it's calculating the total size of the 000108 backup (putting it in this loop), though I'm not sure what exactly would be taking that long.

yarikoptic commented 1 year ago

I see!

first I looked at pyspy top -- hard to grasp due to async ```shell 13.00% 13.00% 12.58s 12.58s process (datalad/runner/utils.py) 4.00% 6.00% 1.68s 1.86s parse_parts (pathlib.py) 10.00% 10.00% 1.62s 1.62s read (datalad/runner/runnerthreads.py) 8.00% 10.00% 1.18s 1.71s _execute_child (subprocess.py) 9.00% 43.00% 0.820s 6.06s _parse_gitmodules (datalad/support/gitrepo.py) 0.00% 0.00% 0.800s 2.20s _get_content_info_line_helper (datalad/support/gitrepo.py) 2.00% 2.00% 0.660s 0.660s _worker (concurrent/futures/thread.py) 2.00% 2.00% 0.530s 0.530s _gitcfg_rec_to_keyvalue (datalad/config.py) 1.00% 7.00% 0.410s 2.27s _parse_args (pathlib.py) 4.00% 6.00% 0.350s 0.880s parse_gitconfig_dump (datalad/config.py) 2.00% 2.00% 0.260s 0.260s (datalad/support/gitrepo.py) 1.00% 6.00% 0.260s 1.09s process_queue (datalad/runner/nonasyncrunner.py) 0.00% 0.00% 0.250s 0.320s _from_parsed_parts (pathlib.py) 3.00% 4.00% 0.240s 0.610s get (queue.py) 1.00% 1.00% 0.220s 0.320s notify (threading.py) 0.00% 57.00% 0.200s 22.62s get_submodules_ (datalad/support/gitrepo.py) 0.00% 2.00% 0.190s 0.520s _close_pipe_fds (subprocess.py) 2.00% 2.00% 0.180s 0.180s splitroot (pathlib.py) 1.00% 1.00% 0.170s 0.170s _cparts (pathlib.py) 0.00% 1.00% 0.170s 0.440s put (queue.py) 0.00% 1.00% 0.170s 0.320s __hash__ (pathlib.py) 1.00% 30.00% 0.150s 15.73s _generator_call_git (datalad/dataset/gitrepo.py) 0.00% 3.00% 0.140s 1.39s _from_parts (pathlib.py) 0.00% 0.00% 0.140s 0.140s decode (datalad/runner/utils.py) 1.00% 5.00% 0.140s 1.16s _make_child (pathlib.py) 0.00% 6.00% 0.120s 1.21s send (datalad/runner/nonasyncrunner.py) 0.00% 0.00% 0.120s 0.120s __enter__ (threading.py) 0.00% 0.00% 0.110s 0.110s __exit__ (threading.py) 0.00% 0.00% 0.100s 0.100s _is_owned (threading.py) 0.00% 1.00% 0.090s 0.230s callback (contextlib.py) 1.00% 1.00% 0.090s 0.090s __exit__ (contextlib.py) 0.00% 0.00% 0.090s 0.090s _init (pathlib.py) 0.00% 0.00% 0.090s 0.460s relative_to (pathlib.py) 1.00% 2.00% 0.080s 0.520s write (datalad/runner/runnerthreads.py) 0.00% 30.00% 0.080s 15.82s call_git_items_ (datalad/dataset/gitrepo.py) 0.00% 0.00% 0.070s 0.070s _create_cb_wrapper (contextlib.py) 0.00% 30.00% 0.070s 15.89s call_git (datalad/dataset/gitrepo.py) 1.00% 1.00% 0.070s 0.070s _push_exit_callback (contextlib.py) 1.00% 1.00% 0.070s 0.220s pipe_data_received (datalad/dataset/gitrepo.py) 1.00% 4.00% 0.060s 1.45s __new__ (pathlib.py) 0.00% 0.00% 0.060s 0.060s _format_parsed_parts (pathlib.py) 0.00% 30.00% 0.060s 19.08s (datalad/support/gitrepo.py) 0.00% 57.00% 0.050s 22.67s _parse_git_submodules (datalad/local/subdatasets.py) 0.00% 0.00% 0.050s 0.050s _try_wait (subprocess.py) 0.00% 0.00% 0.040s 0.100s __str__ (pathlib.py) 0.00% 0.00% 0.040s 0.070s _get_parent_paths_check (datalad/support/path.py) 0.00% 0.00% 0.040s 0.060s __eq__ (pathlib.py) 0.00% 0.00% 0.040s 0.040s _qsize (queue.py) 0.00% 5.00% 0.040s 0.640s __truediv__ (pathlib.py) 0.00% 0.00% 0.030s 0.590s joinpath (pathlib.py) 0.00% 0.00% 0.030s 0.030s get_jsonhooks_from_config (datalad/core/local/resulthooks.py) 0.00% 6.00% 0.030s 1.24s __next__ (_collections_abc.py) 1.00% 1.00% 0.030s 0.030s _acquire_restore (threading.py) 0.00% 12.00% 0.020s 2.21s run (datalad/runner/runnerthreads.py) 0.00% 0.00% 0.020s 0.020s _put (queue.py) 0.00% 0.00% 0.020s 0.020s join_parsed_parts (pathlib.py) 0.00% 0.00% 0.020s 0.020s __init__ (contextlib.py) 0.00% 0.00% 0.020s 0.030s isabs (posixpath.py) 0.00% 0.00% 0.010s 0.010s casefold_parts (pathlib.py) 0.00% 30.00% 0.010s 19.62s get_content_info (datalad/support/gitrepo.py) 0.00% 0.00% 0.010s 0.010s _get_sep (posixpath.py) 0.00% 0.00% 0.010s 0.010s join (posixpath.py) 0.00% 0.00% 0.010s 0.010s getpreferredencoding (locale.py) 0.00% 0.00% 0.010s 0.080s get_parent_paths (datalad/support/path.py) 0.00% 0.00% 0.010s 0.010s send_result (datalad/runner/protocol.py) 0.00% 10.00% 0.010s 1.74s _locked_run (datalad/runner/nonasyncrunner.py) 0.00% 57.00% 0.010s 22.72s eval_func (datalad/interface/utils.py) 0.00% 0.00% 0.010s 0.010s __init__ (datalad/runner/utils.py) 0.00% 0.00% 0.010s 0.010s release_on_behalf_of (anyio/_backends/_asyncio.py) 0.00% 0.00% 0.010s 0.010s add (_weakrefset.py) 0.00% 0.00% 0.010s 0.020s __init__ (threading.py) 0.00% 0.00% 0.010s 0.030s get_stats (backups2datalad/adataset.py) 0.00% 0.00% 0.010s 0.010s __exit__ (anyio/_backends/_asyncio.py) 0.00% 0.00% 0.010s 0.010s __call__ (datalad/support/constraints.py) 0.00% 2.00% 0.000s 0.660s run (threading.py) 0.00% 0.00% 0.000s 0.410s run_on_filelist_chunks_items_ (datalad/runner/gitrunner.py) ```

and then loop you pointed to. I think slow performance due to calling self.get_subdatasets(path=path) for each of those thousands of paths. and trying it out in python interactive shell -- takes about 1-2 seconds per each. so ok -- should finish soon, and

yarikoptic commented 1 year ago

ok, it completed and it somewhat matched the timing of 1-2 sec x 4k submodules. #270 should make that faster AFAIK.