dandi / dandisets

730 Dandisets, 807.1 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

zarr backup might need "optimization" #363

Closed yarikoptic closed 6 months ago

yarikoptic commented 10 months ago

I found 4 days old process still running for non 000108 dandiset. The process tree

dandi      27781 50.7  1.5 3886708 1025188 ?     Rl   Oct27 3546:00                 python -m tools.backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000108$
dandi      90653  0.0  0.0  10820  2868 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=MD5E
dandi      90655  0.0  0.0 1074053100 11264 ?    Sl   Oct27   4:49                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex examinekey --batch --migrate-to-backend=MD5E
dandi      91009  0.0  0.0  10820  2856 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages
dandi      91011  6.1  0.8 1074060012 545916 ?   Sl   Oct27 426:27                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex whereis --batch-keys --json --json-error-messages
dandi      91021  0.0  0.0  14788  5308 ?        S    Oct27   4:55                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91432  0.0  0.0  10820  3008 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex fromkey --force --batch --json --json-error-messages
dandi      91434  0.1  0.0 1074053256 33024 ?    Sl   Oct27  13:16                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex fromkey --force --batch --json --json-error-messages
dandi      91443  0.0  0.0  14748  2236 ?        S    Oct27   0:00                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91499  0.0  0.0  11736  3136 ?        S    Oct27   2:54                       git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --stdin-paths --no-filters
dandi      91566  0.0  0.0  10820  2944 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91569  2.1  0.1 1074126968 71816 ?    Sl   Oct27 149:30                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91598  0.1  0.0  14852  4868 ?        S    Oct27   9:32                       git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.alwayscompact=false cat-file --batch
dandi      91599  0.0  0.0   6952  2472 ?        S    Oct27   3:13                       /bin/bash /usr/bin/git-annex-remote-rclone
dandi      27782  0.0  0.0   6384  2084 ?        S    Oct27   0:00                 grep -v nothing to save, working tree clean

and looking at that zarr

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls -l /proc/91011/cwd
lrwxrwxrwx 1 dandi dandi 0 Nov  1 12:32 /proc/91011/cwd -> /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
0  1  2  3  4
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* | nl | tail
494148  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/29
494149  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/3
494150  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/30
494151  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/31
494152  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2
494153  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2/.zarray
494154  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3
494155  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3/.zarray
494156  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4
494157  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

so it is a "hefty" zarr -- half a million files. I wonder if we could make that process anyhow faster. there was some splitindex etc.

FWIW -- above count is with folders. Without folders:

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* \! -type d | nl | tail -n 1
487185  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

and that particular zarr is almost done so I will keep it going for now

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .file_count
526320

edits:

Total Samples 3277
GIL: 1.00%, Active: 17.00%, Threads: 15

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 11.00%  11.00%   27.13s    27.13s   _worker (concurrent/futures/thread.py)
  5.00%   5.00%   15.57s    15.57s   _do_waitpid (asyncio/unix_events.py)
  0.00%   1.00%   0.080s     1.03s   _run_once (asyncio/base_events.py)
  0.00%   0.00%   0.070s    0.090s   _execute_child (subprocess.py)
  0.00%   0.00%   0.070s    0.070s   _add_callback (asyncio/base_events.py)
  0.00%   1.00%   0.050s    0.880s   _run (asyncio/events.py)
  0.00%   0.00%   0.040s    0.040s   register (selectors.py)
  0.00%   0.00%   0.030s    0.030s   raw_decode (json/decoder.py)

so is it just jumping between different async items or really doing some useful work???

edit: some stats from ncdu. A LOT of files during the backup, then just few at some point there were over 900,000 files in .git/annex/journal ! ```shell --- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git/annex ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- /.. 3.5 GiB [##########] 904.4k /journal 2.0 MiB [ ] index 1.5 MiB [ ] 1 /keysdb 12.0 KiB [ ] 3 /fsck 4.0 KiB [ ] index.lck ``` and separate objects (no packing performed) for each tiny file ``` --- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.5 GiB [##########] 904.4k /annex . 1.7 GiB [#### ] 451.7k /objects ``` which then all get handled eventually and .git/objects packed too: ``` --- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 254.8 MiB [##########] 16 /objects 241.3 MiB [######### ] 18 /annex 38.9 MiB [# ] index ```
satra commented 10 months ago

which asset is this? i want to check that the shape/compression characteristics did not change in the process and this is indeed a hefty zarr (i.e. could be one of the 4mm slices).

also i'm going to start rolling out not storing rawest data but stitched data.

yarikoptic commented 10 months ago
❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .
{
  "name": "sub-I58/ses-Hip-CT/micr/sub-I58_sample-01_chunk-01_hipCT.ome.zarr",
  "dandiset": "000026",
  "zarr_id": "5c37c233-222f-4e60-96e7-a7536e08ef61",
  "status": "Complete",
  "checksum": "4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554",
  "file_count": 526320,
  "size": 1023396474554
}
satra commented 10 months ago

this is dandiset 26, not 108. it's probably the TB one. it's an entire hemisphere and more at 15um resolution.

satra commented 10 months ago

i didn't read "non" 000108 dandiset - i thought it was in 108. but this one is beautiful. yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

yarikoptic commented 10 months ago

yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

is there a link?

satra commented 10 months ago

https://github.com/bids-standard/bids-specification/issues/1646

yarikoptic commented 6 months ago

Let's consider migrated to