dandi / dandisets

760 Dandisets, 817.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

"group" or limit registerurl calls? #225

Closed yarikoptic closed 1 year ago

yarikoptic commented 2 years ago

Looking at what processes are busy on smaug I see that git-annex registerurl --batch --json --json-error-messages is one of the busiest in terms of IO. I thought that it might be due to "flat" journal , so inquired with @joeyh https://git-annex.branchable.com/todo/more___34__filesystem_efficient__34___journalling__63___/ but he thinks everything is fine on that end.

I strace'd one of such processes for a few seconds. Full log here . Some observations:

$> grep MD5E-s560--82188063b1988362cc3050918f493320 registerurl-3644019.log 
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 16
3644019 openat(AT_FDCWD, ".git/annex/othertmp/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666 <unfinished ...>
3644019 rename(".git/annex/othertmp/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web") = 0
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 16
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 16
3644019 openat(AT_FDCWD, ".git/annex/othertmp/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666 <unfinished ...>
3644019 rename(".git/annex/othertmp/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web") = 0
...

That file grew to over 1MB in size at the end:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandizarrs/0176fece-87bd-4a63-acb6-c9f57e3c53e6/.git/annex/journal$ git ls-tree --long -r git-annex | grep MD5E-s560--82188063b1988362cc3050918f493320
100644 blob 4a6ff8dc16ab9615e182cb45f82b130659749ebe      58    5a4/81f/MD5E-s560--82188063b1988362cc3050918f493320.log
100644 blob b914169e63289583f52fa0035da4e7acfac985f3 1181365    5a4/81f/MD5E-s560--82188063b1988362cc3050918f493320.log.web

any other ideas @jwodder ?

yarikoptic commented 2 years ago

echoing https://git-annex.branchable.com/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/#comment-85131966a882ff24431566d146b5634b - can we reorder registerurl calls so that we do all additions of urls to a key first before proceeding to the next key? I thought so since we are probably first fetching all the information first about the keys, so can reorder any way we want, right @jwodder ?

jwodder commented 2 years ago

@yarikoptic

I thought so since we are probably first fetching all the information first about the keys

If by "key" you mean "git-annex key" (not to be confused with "S3 key"), we are not. We fetch data about the entries in S3 ordered by path in paginated batches. Fetching everything before doing anything would not be efficient.

yarikoptic commented 2 years ago

yes -- annex key. Well, I know now that we have some very inefficient behavior in git-annex which is if not resolved might need us to make sacrifice in walking s3 efficiency I guess.

joeyh commented 2 years ago

To fix this, get git-annex version 10.20220724 (will be released on Monday) or current master.

Then set annex.alwayscompact=false when running this registerurl or anything else that needs to write bulk data to the git-annex branch.

Note that, it's not entirely safe to set that if an older version of git-annex can still be used in the repository. How to avoid that, I have to leave up to you.

Speedup should be massive BTW.

yarikoptic commented 2 years ago

I think I can confirm speed up (will rerun once again) on the original sample case:

with annex.alwayscompact=false 49:33.03elapsed, without 1:34:20elapsed, so 50 min vs 1h 34 min ```shell (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ HOME=$PWD/HOME/ PYTHONPATH=/mnt/backup/dandi/dandisets/tools:$PYTHONPATH /usr/bin/time python sync_zarr-v1.py 000108 sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr test2-alwayscompact=false 2>&1 | tee test2-alwayscompact=false.out Backing up 000108:sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr into test2-alwayscompact=false cd create(ok): /mnt/backup/dandi/tmp/bm-git-annex/test2-alwayscompact=false/0176fece-87bd-4a63-acb6-c9f57e3c53e6 (dataset) add(ok): .dandi/.gitattributes (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) whereis: 106817 failed add(ok): .dandi/s3sync.json (file) add(ok): .dandi/zarr-checksum (file) save(ok): . (dataset) action summary: add (ok: 2) save (ok: 1) 1508.20user 293.12system 49:33.03elapsed 60%CPU (0avgtext+0avgdata 263844maxresident)k 11760inputs+13400480outputs (16563major+32622629minor)pagefaults 0swaps (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ PYTHONPATH=/mnt/backup/dandi/dandisets/tools:$PYTHONPATH /usr/bin/time python sync_zarr-v1.py 000108 sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr test2-default 2>&1 | tee test2-default.out Backing up 000108:sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr into test2-default create(ok): /mnt/backup/dandi/tmp/bm-git-annex/test2-default/0176fece-87bd-4a63-acb6-c9f57e3c53e6 (dataset) add(ok): .dandi/.gitattributes (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) whereis: 106817 failed add(ok): .dandi/s3sync.json (file) add(ok): .dandi/zarr-checksum (file) save(ok): . (dataset) action summary: add (ok: 2) save (ok: 1) 1938.18user 825.74system 1:34:20elapsed 48%CPU (0avgtext+0avgdata 262872maxresident)k 278568inputs+31755512outputs (16101major+42406561minor)pagefaults 0swaps (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ diff -Naur ~/.gitconfig $PWD/HOME/.gitconfig --- /home/dandi/.gitconfig 2022-07-29 08:44:54.377969020 -0400 +++ /mnt/backup/dandi/tmp/bm-git-annex/HOME/.gitconfig 2022-08-09 13:51:58.992612662 -0400 @@ -6,9 +6,11 @@ oauthtoken = BIGSECRET! [annex] stalldetection = 1KB/120s + alwayscompact = false # v9 feature to avoid smudge filters in some cases and speed up operation # originally applied to 000026 but then noted that zarr commits are slow # as well, so decided to apply globally [filter "annex"] process = git-annex filter-process + (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ git annex version git-annex version: 10.20220724-ge30d846 ```
rerun showed similar/better improvements ```shell (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ PYTHONPATH=/mnt/backup/dandi/dandisets/tools:$PYTHONPATH /usr/bin/time python sync_zarr-v1.py 000108 sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr test2-default-2 2>&1 | tee test2-default.out-2 Backing up 000108:sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr into test2-default-2 create(ok): /mnt/backup/dandi/tmp/bm-git-annex/test2-default-2/0176fece-87bd-4a63-acb6-c9f57e3c53e6 (dataset) add(ok): .dandi/.gitattributes (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) whereis: 106817 failed add(ok): .dandi/s3sync.json (file) add(ok): .dandi/zarr-checksum (file) save(ok): . (dataset) action summary: add (ok: 2) save (ok: 1) 1919.45user 862.80system 1:25:11elapsed 54%CPU (0avgtext+0avgdata 262208maxresident)k 228808inputs+31395600outputs (27210major+44558947minor)pagefaults 0swaps (dandi-cli-latest) dandi@drogon:/mnt/backup/dandi/tmp/bm-git-annex$ HOME=$PWD/HOME/ PYTHONPATH=/mnt/backup/dandi/dandisets/tools:$PYTHONPATH /usr/bin/time python sync_zarr-v1.py 000108 sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr test2-alwayscompact=false-2 2>&1 | tee test2-alwayscompact=false.out-2 Backing up 000108:sub-MITU01/ses-20210915h15m45s27/micr/sub-MITU01_ses-20210915h15m45s27_sample-109_stain-LEC_run-1_chunk-6_SPIM.ome.zarr into test2-alwayscompact=false-2 create(ok): /mnt/backup/dandi/tmp/bm-git-annex/test2-alwayscompact=false-2/0176fece-87bd-4a63-acb6-c9f57e3c53e6 (dataset) add(ok): .dandi/.gitattributes (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) whereis: 106817 failed add(ok): .dandi/s3sync.json (file) add(ok): .dandi/zarr-checksum (file) save(ok): . (dataset) action summary: add (ok: 2) save (ok: 1) 1393.50user 244.74system 28:26.96elapsed 95%CPU (0avgtext+0avgdata 261208maxresident)k 43832inputs+11924872outputs (1077major+34324068minor)pagefaults 0swaps ```
yarikoptic commented 1 year ago

ok, I will consider this one resolved :