dandi / dandisets

717 Dandisets, 805.9 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

git-annex died of signal 9 #33

Closed djarecka closed 3 years ago

djarecka commented 3 years ago

I'm downloading dandi data using datalad and this error was raised more than once:

[23:54][5.93][-83%]djarecka@openmind7:dandi$ datalad install -J5 -g -r https://github.com/dandi/dandi-api-datasets
[INFO   ] Installing Dataset(/om2/scratch/Sat/djarecka/dandi/dandi-api-datasets) to get /om2/scratch/Sat/djarecka/dandi/dandi-api-datasets recursively
Total:  23%|████████████                                         | 508G/2.22T [1:07:22<13641:18:14, 34.9kB/s][WARNING] Still have 7 active progress bars when stopping 0%|████████████▉| 144G/144G [23:29<00:00, 5.61MB/s]
                                                                                                             CommandError: 'git annex get -c annex.dotfiles=true -c annex.retry=3 --json --json-error-messages --json-progress -J5 -- .' failed with exitcode 137 under /rdma/vast-rdma/scratch/Sat/djarecka/dandi/dandi-api-datasets/dandisets/000003
error: git-annex died of signal 9
DataLad 0.13.7 WTF (configuration, datalad, dependencies, environment, extensions, git-ann\ ex, location, metadata_extractors, python, system) # WTF ## configuration ## datalad - full_version: 0.13.7 - version: 0.13.7 ## dependencies - annexremote: 1.5.0 - appdirs: 1.4.4 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 8.20201129-geb388e6 - cmd:bundled-git: UNKNOWN - cmd:git: 2.30.0 - cmd:system-git: 2.30.0 - cmd:system-ssh: 8.4p1 - humanize: 3.2.0 - iso8601: 0.1.13 - keyring: 22.0.1 - keyrings.alt: UNKNOWN - msgpack: 1.0.2 - requests: 2.25.1 - wrapt: 1.12.1 ## environment - LANG: en_US.UTF-8 - PATH: /om/user/djarecka/elastix/bin:/home/djarecka/bin:/om2/user/djarecka/miniconda/envs/datalad/bin:/om\ 2/user/djarecka/miniconda/condabin:/om/user/djarecka/elastix/bin:/home/djarecka/bin:/cm/shared/apps/gcc/4.8.\ 4/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin ## extensions ## git-annex - build flags: - Assistant - Webapp - Pairing - Inotify - DBus - DesktopNotify - TorrentParser - MagicMime - Feeds - Testsuite - S3 - WebDAV - dependency versions: - aws-0.22 - bloomfilter-2.0.1.0 - cryptonite-0.26 - DAV-1.3.4 - feed-1.3.0.1 - ghc-8.8.4 - http-client-0.6.4.1 - persistent-sqlite-2.10.6.2 - torrent-10000.1.1 - uuid-1.3.13 - yesod-1.6.1.0 - key/value backends: - SHA256E - SHA256 - SHA512E - SHA512 - SHA224E - SHA224 - SHA384E - SHA384 - SHA3_256E - SHA3_256 - SHA3_512E - SHA3_512 - SHA3_224E - SHA3_224 - SHA3_384E - SHA3_384 - SKEIN256E - SKEIN256 - SKEIN512E - SKEIN512 - BLAKE2B256E - BLAKE2B256 - BLAKE2B512E - BLAKE2B512 - BLAKE2B160E - BLAKE2B160 - BLAKE2B224E - BLAKE2B224 - BLAKE2B384E - BLAKE2B384 - BLAKE2BP512E - BLAKE2BP512 - BLAKE2S256E - BLAKE2S256 - BLAKE2S160E - BLAKE2S160 - BLAKE2S224E - BLAKE2S224 - BLAKE2SP256E - BLAKE2SP256 - BLAKE2SP224E - BLAKE2SP224 - SHA1E - SHA1 - MD5E - MD5 - WORM - URL - X* - operating system: linux x86_64 - remote types: - git - gcrypt - p2p - S3 - bup - directory - rsync - web - bittorrent - webdav - adb - tahoe - glacier - ddar - git-lfs - httpalso - borg - hook - external - supported repository versions: - 8 - upgrade supported from repository versions: - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - version: 8.20201129-geb388e6 ## location - path: /rdma/vast-rdma/scratch/Sat/djarecka/dandi - type: directory ## metadata_extractors - annex: - load_error: None - module: datalad.metadata.extractors.annex - version: None - audio: - load_error: No module named 'mutagen' [audio.py::17] - module: datalad.metadata.extractors.audio - datacite: - load_error: None - module: datalad.metadata.extractors.datacite - version: None - datalad_core: - load_error: None - module: datalad.metadata.extractors.datalad_core - version: None - datalad_rfc822: - load_error: None - module: datalad.metadata.extractors.datalad_rfc822 - version: None - exif: - load_error: No module named 'exifread' [exif.py::16] - module: datalad.metadata.extractors.exif - frictionless_datapackage: - load_error: None - module: datalad.metadata.extractors.frictionless_datapackage - version: None - image: - load_error: No module named 'PIL' [image.py::16] - module: datalad.metadata.extractors.image - xmp: - load_error: No module named 'libxmp' [xmp.py::20] - module: datalad.metadata.extractors.xmp ## python - implementation: CPython - version: 3.8.6 ## system - distribution: centos/7/Core - encoding: - default: utf-8 - filesystem: utf-8 - locale.prefered: UTF-8 - max_path_length: 287 - name: Linux - release: 3.10.0-1062.el7.x86_64 - type: posix - version: #1 SMP Wed Aug 7 18:08:02 UTC 2019
yarikoptic commented 3 years ago

Thanks for the report! Could you please paste the output of datalad wtf -D html_details in the original issue description? it seems there were reports before on such an error (from git itself ) - I want to make sure we are not talking about some ancient versions of things here.

djarecka commented 3 years ago

@yarikoptic - updated

yarikoptic commented 3 years ago

what is your output of limit command? I wonder if amount of resources (RSS etc) is "restricted" and that leads the system to kill some underlying process (git or git-annex) when it grabs "too much". git annex e.g. at times requests quite a hefty chunk of address space (although not really using that memory so not running out of RAM) - so if you have some addressspace limits imposed, that would explain it.

yarikoptic commented 3 years ago

there is no limit there, but

here is what ulimit -a gives for me -- seems to be ok ```shell *$> ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 514789 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) 7200 max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ```
yarikoptic commented 3 years ago

so it seems to be boiling down to restrictions of the system on run time of the processes... I filed referenced above issue in datalad on better error reporting and https://github.com/datalad/datalad/issues/5447 to possibly add "auto restart" of such processes... but not sure when/if that would be implemented. I could only recommend gaining "super privileges" or seeing if system could allow some processes (matched by name?) to run longer.