datalad / datalad

Keep code, data, containers under control with git and git-annex
http://datalad.org
Other
513 stars 111 forks source link

main:Database.Handle error #7278

Open djarecka opened 1 year ago

djarecka commented 1 year ago

What is the problem?

I'm getting a weird error when running datalad run. I saw this error first time some weeks ago on our cluster, but after re-running it didn't happen, so I tried to forget about it... but it got worse! This week I've kept seeing it way too often.

This is what I get from the output file (the script is run whith sbatch):

get(error): inputs/data/sub-MM273/ses-1year/anat/sub-MM273_ses-1year_run-1_T1w.nii.gz (file) [sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:79:40 in main:Database.Handle
sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:79:40 in main:Database.Handle
sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:79:40 in main:Database.Handle]

This is happen when running datalad run -i code/fmriprep_run.sh -i inputs/data/sub-MM273 -i 'inputs/data/*json' -i containers/images/bids/bids-fmriprep--21.0.2.sing --explicit -o derivatives -m 'fmriprep:21.0.2 sub-MM273' code/remove-all-other-subjects-first.sh inputs/data sub-MM273 code/fmriprep_run.sh sub-MM273 21.0.2 1year

I thought at the beginning that it might happen only when two jobs have -i inputs/data/sub-MM273, but it's not true. I'm having problem with figuring out any pattern (except that it occurs more often now then some weeks ago, but no idea why)

What steps will reproduce the problem?

I wish I knew..

DataLad information

`datalad wtf` output # WTF ## configuration ## credentials - keyring: - active_backends: - PlaintextKeyring with no encyption v.1.0 at /home/djarecka/.local/share/python_keyring/keyring_pass.cfg - config_file: /home/djarecka/.config/python_keyring/keyringrc.cfg - data_root: /home/djarecka/.local/share/python_keyring ## datalad - version: 0.18.1 ## dataset - branches: - git-annex@ca34f6f - master@9167792 - id: 36f32284-14ce-4db0-8af8-544be78bde61 - path: /om2/user/djarecka/bootstrap/deb2_18_21_feb3a/analysis - repo: AnnexRepo ## dependencies - annexremote: 1.5.0 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 10.20220927-geb4a544 - cmd:bundled-git: UNKNOWN - cmd:git: 2.39.1 - cmd:ssh: 9.2p1 - cmd:system-git: 2.39.1 - cmd:system-ssh: 9.2p1 - humanize: 4.5.0 - iso8601: 1.1.0 - keyring: 23.13.1 - keyrings.alt: 4.0.2 - msgpack: 1.0.4 - platformdirs: 2.6.2 - requests: 2.28.2 ## environment - LANG: en_US.UTF-8 - PATH: /om2/user/djarecka/miniconda/envs/datalad_018/bin:/om2/user/djarecka/miniconda/condabin:/om/user/djarecka/elastix/bin:/home/djarecka/bin:/cm/shared/apps/gcc/4.8.4/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin ## extensions ## git-annex - build flags: - Assistant - Webapp - Pairing - Inotify - DBus - DesktopNotify - TorrentParser - MagicMime - Benchmark - Feeds - Testsuite - S3 - WebDAV - dependency versions: - aws-0.22 - bloomfilter-2.0.1.0 - cryptonite-0.29 - DAV-1.3.4 - feed-1.3.2.0 - ghc-8.10.7 - http-client-0.7.9 - persistent-sqlite-2.13.0.3 - torrent-10000.1.1 - uuid-1.3.15 - yesod-1.6.1.2 - key/value backends: - SHA256E - SHA256 - SHA512E - SHA512 - SHA224E - SHA224 - SHA384E - SHA384 - SHA3_256E - SHA3_256 - SHA3_512E - SHA3_512 - SHA3_224E - SHA3_224 - SHA3_384E - SHA3_384 - SKEIN256E - SKEIN256 - SKEIN512E - SKEIN512 - BLAKE2B256E - BLAKE2B256 - BLAKE2B512E - BLAKE2B512 - BLAKE2B160E - BLAKE2B160 - BLAKE2B224E - BLAKE2B224 - BLAKE2B384E - BLAKE2B384 - BLAKE2BP512E - BLAKE2BP512 - BLAKE2S256E - BLAKE2S256 - BLAKE2S160E - BLAKE2S160 - BLAKE2S224E - BLAKE2S224 - BLAKE2SP256E - BLAKE2SP256 - BLAKE2SP224E - BLAKE2SP224 - SHA1E - SHA1 - MD5E - MD5 - WORM - URL - X* - local repository version: 10 - operating system: linux x86_64 - remote types: - git - gcrypt - p2p - S3 - bup - directory - rsync - web - bittorrent - webdav - adb - tahoe - glacier - ddar - git-lfs - httpalso - borg - hook - external - supported repository versions: - 8 - 9 - 10 - upgrade supported from repository versions: - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - version: 10.20220927-geb4a544 ## location - path: /om2/user/djarecka/bootstrap/deb2_18_21_feb3a/analysis - type: dataset ## python - implementation: CPython - version: 3.9.16 ## system - distribution: centos/7/Core - encoding: - default: utf-8 - filesystem: utf-8 - locale.prefered: UTF-8 - filesystem: - CWD: - max_pathlength: 4096 - mount_opts: rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1536,noquota - path: /net/vast-storage.ib.cluster/scratch/vast/gablab/djarecka/bootstrap/deb2_18_21_feb3a/analysis - type: xfs - HOME: - max_pathlength: 4096 - mount_opts: rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1536,noquota - path: /home/djarecka - type: xfs - TMP: - max_pathlength: 4096 - mount_opts: rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1536,noquota - path: /tmp - type: xfs - max_path_length: 310 - name: Linux - release: 3.10.0-1062.el7.x86_64 - type: posix - version: #1 SMP Wed Aug 7 18:08:02 UTC 2019

Additional context

I can say that it also happened with datalad 0.16.

I showed this error once to @yarikoptic, and it seems like he saw the error at some point, but believed that this has been resolved. I can't find it in the issues.

Any idea where this is coming from? What should I check?

Have you had any success using DataLad before?

only successes ;-)

yarikoptic commented 1 year ago

reminiscent of https://git-annex.branchable.com/bugs/get_is_busy_doing_nothing/, you are running 10.20220927-geb4a544 , that report last had action by @joeyh less than 4 months ago so likely you are not yet using his "fix up" (unfortunately the sha in the version is unknown to annex history, uff, there was some oddity with building versions at some point). So I would first recommend to try newer version of git-annex.

We might want to establish client testing of git-annex on your cluster to ensure that it stays kosher with it. Do you think if cron jobs work fine there? (I will try)

djarecka commented 1 year ago

ok, I will try to update git-annex (Sunday or Monday, tomorrow i'm afk). The version comes from today's installation via conda

djarecka commented 1 year ago

so this is the newest version I could get from conda and since it 10.2023* I understood that it is newer than the one that had you discussed the bug, am I right?

[07:20][10.95][-23%]djarecka@openmind7:analysis$ git annex version
git-annex version: 10.20230126-g36f5557
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-8.10.7 http-client-0.7.9 persistent-sqlite-2.13.0.3 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.1.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10

edit I believe this version was uploaded to conda just a couple days ago

djarecka commented 1 year ago

I'm not sure if this is important, but today I've been testing the new version more, and I've notice one new(?) thing: the error might occur after getting some of the files just fine (I believe previously I mostly/always saw the problem right away), e.g.:

Before running datalad run
I am in /om2/scratch/Sat/djarecka/deb2_18an_fake_feb9a/sub-MM274baseline/ds
get(ok): inputs/data/sub-MM274/ses-baseline/anat/sub-MM274_ses-baseline_run-1_T1w.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/anat/sub-MM274_ses-baseline_run-1_T2w.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/dwi/sub-MM274_ses-baseline_acq-DFC_dwi.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/dwi/sub-MM274_ses-baseline_acq-ORIG_dwi.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/fmap/sub-MM274_ses-baseline_acq-diff_dir-AP_run-1_epi.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/fmap/sub-MM274_ses-baseline_acq-task_dir-AP_run-1_epi.nii.gz (file) [from origin...]
get(ok): inputs/data/sub-MM274/ses-baseline/fmap/sub-MM274_ses-baseline_acq-task_dir-PA_run-1_epi.nii.gz (file)
get(error): inputs/data/sub-MM274/ses-baseline/func/sub-MM274_ses-baseline_task-mid_rec-moco_run-1_bold.nii.gz (file) [sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:82:40 in main:Database.Handle
sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:82:40 in main:Database.Handle
sqlite query crashed: thread blocked indefinitely in an MVar operation
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:82:40 in main:Database.Handle]
yarikoptic commented 1 year ago

I will later try to sense/reproduce on om and would need to alert @joeyh :-/ he might need access to the beast to troubleshoot... eh heh -- never a boring Thu ;)

djarecka commented 1 year ago

let me know if I can help debugging it. If you want to login to om to see the files, the latest comes from: /om2/user/djarecka/bootstrap/deb2_18an_fake_feb9a/analysis/logs/array_29361723_1.out and it is a result of running sbatch code/sbatch_array_ses-baseline.sh (from /om2/user/djarecka/bootstrap/deb2_18an_fake_feb9a/analysis) However it might not happen when you run again, I think today, with the newest version it happens less often, but it could be just my lucky day ;-)

yarikoptic commented 1 year ago

FWIW, had difficulties upgrading git-annex in conda on om. Added it as a test client for daily git-annex builds. Let's see how that works out. For now it would only run git-annex test.