Report on running #438 usecases with datalad 0.13.0rc1 and "local" resource

yarikoptic commented 4 years ago

This was an attempt to run with @kyleam 's version of the script with only minor tune ups, as available from the branch scratch/use-case-local here. Command I had was

bash /home/yoh/proj/repronim/reproman/docs/usecases/bids-fmriprep-workflow-NP.sh try1 2>&1 | tee try1.log

and full log try1.log is now at http://www.onerussian.com/tmp/reproman-try1.txt (renamed).

1. The overall failure was due to

+ reproman run --follow -r local --sub local --orc datalad-pair --jp container=containers/bids-fmriprep --input data/bids --output data/fmriprep --bp pl=02,13 '{inputs}' '{outputs}' participant --participant_label '{p[pl]}' --fs-license-file=containers/licenses/freesurfer -w work
2020-05-07 18:14:13,964 [INFO] No root directory supplied for local; using '/home/yoh/.reproman/run-root'
[INFO] Publishing Dataset(/home/yoh/temp-test-reproman/try1) to local 
...
2020-05-07 18:14:14,954 [ERROR] 'datalad publish' failed. Try running 'datalad update -s local --merge --recursive' first [orchestrators.py:prepare_remote:793] (OrchestratorError)

while it went just fine upon checking and retrying after crash

```shell lena:~/temp-test-reproman $> datalad status -d /home/yoh/temp-test-reproman/try1 nothing to save, working tree clean $> datalad siblings -d /home/yoh/temp-test-reproman/try1 .: here(+) [git] .: local(+) [/home/yoh/.reproman/run-root/ac410208-90ae-11ea-889c-c3382a41054b (git)] $> datalad status -d /home/yoh/.reproman/run-root/ac410208-90ae-11ea-889c-c3382a41054b nothing to save, working tree clean $> datalad publish -d /home/yoh/temp-test-reproman/try1 --to=local ```

which is strange. I wish ERROR message provided a more informative message (i.e. how it failed, some stderr or returned record from datalad publish invocation if done in Python).

2. In the log you could see above this failure

2020-05-07 18:14:10,194 [INFO] Getting outputs from 'local'
2020-05-07 18:14:10,430 [WARNING] 1 subjob failed. Check files in /home/yoh/temp-test-reproman/try1/.reproman/jobs/local/20200507-181356-aeda/
2020-05-07 18:14:10,430 [INFO] 20200507-181356-aeda stderr: /home/yoh/temp-test-reproman/try1/.reproman/jobs/local/20200507-181356-aeda/stderr.0
2020-05-07 18:14:10,603 [INFO] Finished with remote resource 'local'
2020-05-07 18:14:10,604 [INFO] Unregistered job 20200507-181356-aeda

for running mriqc. so there is a WARNING but overall it seems it didn't cause the process to fail. @kyleam shouldn't overall run --follow fail if subjob fails?

trusting noone including myself I have tested that it is not somehow my demo script which should fail if individual command fails due to `set -e` on the top... it fails as expected with the following patch

```shell (git)lena:~/proj/repronim/reproman[scratch/use-case-local] $> git diff diff --git a/docs/usecases/bids-fmriprep-workflow-NP.sh b/docs/usecases/bids-fmriprep-workflow-NP.sh index 3ac4eca..c54473c 100755 --- a/docs/usecases/bids-fmriprep-workflow-NP.sh +++ b/docs/usecases/bids-fmriprep-workflow-NP.sh @@ -97,7 +97,8 @@ unknown_runner () { # Common invocation of ReproMan # TODO: just make it configurable per project/env? reproman_run () { - reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" "$@" + /bin/ls bogus + # reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" "$@" } ```

FTR, the content of that stderr file was

$> cat /home/yoh/temp-test-reproman/try1/.reproman/jobs/local/20200507-181356-aeda/stderr.0
ERROR  : Failed to mount squashfs image in (read only): Invalid argument
ABORT  : Retval = 255

my singularity-container from neurodebian is 2.6.1-2~nd100+1

DataLad 0.13.0rc1 WTF (dependencies, datalad)

# WTF ## datalad - full_version: 0.13.0rc1 - version: 0.13.0rc1 ## dependencies - appdirs: 1.4.3 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 7.20190708+git9-gfa3524b95-1~ndall+1 - cmd:bundled-git: 2.20.1 - cmd:git: 2.20.1 - cmd:system-git: 2.26.2 - cmd:system-ssh: 8.1p1 - exifread: 2.1.2 - git: 3.1.0 - gitdb: 4.0.2 - humanize: 2.3.0 - iso8601: 0.1.12 - keyring: 18.0.1 - keyrings.alt: 3.4.0 - msgpack: 0.6.2 - mutagen: 1.40.0 - requests: 2.23.0 - tqdm: 4.43.0 - wrapt: 1.11.2

kyleam commented 4 years ago

[publish failure]

I wish ERROR message provided a more informative message (i.e. how it failed, some stderr or returned record from datalad publish invocation if done in Python).

Indeed, that was pretty hard to do anything with. It should be more helpful after 45baab076 (ENH: orchestrators: Provide access to datalad-publish results, 2020-05-14), which is in gh-506.

Running with local submitter and datalad-pair-run (I still can't figure out why my tree is left dirty when using datalad-pair), I was able to trigger a publish error, or rather multiple. The first one a saw was due to another "git tries to fetch non-existent origin" issue (gh-499). A similar issue was fixed in update, but this one is in publish.

error

``` [ERROR ] CommandError: 'git fetch --verbose --progress local' failed with exitcode 1 under /home/kyle/scratch/reproman/gh-438/out5 [err: 'remote: Total 25 (delta 4), reused 0 (delta 0), pack-reused 0 | From /home/kyle/.reproman/run-root/1e5d7eba-9ace-11ea-a5bd-54e1ada290c0 | 30381ed..97b1944 git-annex -> local/git-annex | 7dc5d91..b9c9f0a master -> local/master | Fetching submodule data/mriqc | fatal: 'origin' does not appear to be a git repository | fatal: Could not read from remote repository. | | Please make sure you have the correct access rights | and the repository exists. | Errors during submodule fetch: | data/mriqc'] [cmd.py:run:477] (CommandError) ```

DataLad's 8343899d3 (BF: publish: Tell git-fetch to not recurse into submodules, 2020-05-20), which you reviewed/merged yesterday (thanks!), resolves this issue. Running things again gets farther, but hits another publish error. Based on the spot were the failure happened in your log, I'd guess this might be the one you hit.

[ERROR  ] 'datalad publish' failed: {'action': 'publish', 'path': '/home/kyle/scratch/reproman/gh-438/out6/data/fmriprep', 'type': 'dataset', 'refds': '/home/kyle/scratch/reproman/gh-438/out6', 'status': 'error', 'mode_src': 0, 'mode': 57344, 'revision_src': None, 'revision': '0cb506b90c06a7d7d7b88e30d2808fb50d7ebc27', 'parentds': '/home/kyle/scratch/reproman/gh-438/out6', 'state': 'added', 'type_src': None, 'registered_subds': True, 'process_content': True, 'message': ("Unknown target sibling '%s' for publication", 'local')} [orchestrators.py:prepare_remote:792] (OrchestratorError

So, it looks like run needs to do a better job of setting up the sibling in subdatasets that pop up after the initial run. You mentioned that running the suggested publish command manually worked, but I think that's just because the suggestion left out --recursive :/

yarikoptic commented 4 years ago

Whoohoo -- got the dirty one. It seems to point to the logs from the failed subjob run:

```shell $> BIDS_APPS=mriqc FS_LICENSE=bogus RM_ORC=datalad-pair CONTAINERS_REPO=~/proj/repronim/containers INPUT_DATASET_REPO=https://github.com/ReproNim/ds000003-demo ./bids-fmriprep-workflow-NP.sh bids-fmriprep-workflow-NP/out1 ... 2020-05-21 13:48:01,188 [INFO ] Fetching results for 20200521-133350-efb0 2020-05-21 13:48:01,696 [WARNING] 1 subjob failed. Check files in /home/yoh/proj/repronim/reproman/docs/usecases/bids-fmriprep-workflow-NP/out1/.reproman/jobs/local/20200521-133350-efb0/ 2020-05-21 13:48:01,696 [INFO ] 20200521-133350-efb0 stderr: /home/yoh/proj/repronim/reproman/docs/usecases/bids-fmriprep-workflow-NP/out1/.reproman/jobs/local/20200521-133350-efb0/stderr.1 2020-05-21 13:48:01,697 [INFO ] Unregistered job 20200521-133350-efb0 ex:0 > case "$do_group" in ex:0 > reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work ex:0 > reproman run --follow -r local --sub local --orc datalad-pair --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work 2020-05-21 13:48:06,403 [INFO ] No root directory supplied for local; using '/home/yoh/.reproman/run-root' 2020-05-21 13:48:06,912 [ERROR ] Remote repository /home/yoh/.reproman/run-root/2bc9a91e-9b89-11ea-be9b-ff519d1f6bc9 is dirty [orchestrators.py:_assert_clean_repo:698] (OrchestratorError) BIDS_APPS=mriqc FS_LICENSE=bogus RM_ORC=datalad-pair CONTAINERS_REPO= = 38.47s user 26.24s system 7% cpu 14:45.34 total $> datalad status -d /home/yoh/.reproman/run-root/2bc9a91e-9b89-11ea-be9b-ff519d1f6bc9 modified: .reproman/jobs/local/20200521-133350-efb0/status.0 (file) modified: .reproman/jobs/local/20200521-133350-efb0/stderr.0 (file) modified: .reproman/jobs/local/20200521-133350-efb0/stdout.0 (file) changes on filesystem: ```

I just pushed b70144e993660c271831e4ea8d2f4bb436bb7eeb which adds those few env vars to avoid needing to patch script altogether just to e.g. run mriqc and have no FS license. Was about to finish the script to ease reproducing etc, but probably after hangout now

yarikoptic commented 4 years ago

"cool" -- rerun completed with the same dirty failure but there is no failed job! status says succeeded for both. Will dig now into timing of things by adding PS4 with datetime and set -x

yarikoptic commented 4 years ago

FWIW: gkhm... dash just stalls at PS4 definition line if it has some $() to run the date inside... bizzare

details

```shell $> head ./runscript #!/bin/sh set -eu export PS4='$(date +%T.%N) [$$]> ' set -x jobid=20200521-154644-9ca0 subjob=$1 num_subjobs=2 $> dash -x ./runscript + set -eu + export PS4=$(date +%T.%N) [$$]> ^C ```

so I think I will first try to replicate it with bash as the shell in shebang, and then will add timing.

yarikoptic commented 4 years ago

I think I got an idea for a reason, please correct me if I am wrong: output from runscript is getting dumped into those stdout/err files. But it is the same runscript (for the last job) which invokes datalad add/save. So the stdout/err file is still open and might be not flushed while datalad add commits it, then it exits possibly flushing it (although possibly not introducing any additional output), and thus leaving it in a new "dirty" state. Got an idea from looking at the output with PS4 and set -x where the diff was the last debug lines which included running add and follow up commands, and timestamp for modification of that file was almost a minute away from when add was invoked:

details from the shell

```shell (git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200521-223725-6b5a $> grep -h '^> [0-9]' stderr.[01] | sort -n --key 1 > 22:37:54.832090692 [2323864] jobid=20200521-223725-6b5a > 22:37:54.832097363 [2323866] jobid=20200521-223725-6b5a > 22:37:54.833061784 [2323864] subjob=0 > 22:37:54.833061882 [2323866] subjob=1 > 22:37:54.833885573 [2323866] num_subjobs=2 > 22:37:54.833936101 [2323864] num_subjobs=2 > 22:37:54.834668463 [2323866] metadir=/home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200521-223725-6b5a > 22:37:54.834712116 [2323864] metadir=/home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200521-223725-6b5a > 22:37:54.835391003 [2323866] rootdir=/home/yoh/.reproman/run-root > 22:37:54.835534832 [2323864] rootdir=/home/yoh/.reproman/run-root > 22:37:54.836338922 [2323866] workdir=/home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9 > 22:37:54.836479231 [2323864] workdir=/home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9 > 22:37:54.837426089 [2323866] _reproman_cmd_idx=2 > 22:37:54.837427537 [2323864] _reproman_cmd_idx=1 > 22:37:54.838376525 [2323866] export _reproman_cmd_idx > 22:37:54.838383374 [2323864] export _reproman_cmd_idx > 22:37:54.839278646 [2323866] echo submitted > 22:37:54.839297218 [2323864] echo submitted > 22:37:54.840196864 [2323864] echo '[ReproMan] pre-command...' > 22:37:54.840247596 [2323866] echo '[ReproMan] pre-command...' > 22:37:54.841075559 [2323864] cd /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9 > 22:37:54.841104519 [2323866] cd /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9 > 22:37:54.845909974 [2323864] cmd='containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''02'\'' -w work' > 22:37:54.845934297 [2323866] cmd='containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work' > 22:37:54.846795009 [2323864] test -z 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''02'\'' -w work' > 22:37:54.846879778 [2323866] test -z 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work' > 22:37:54.847687052 [2323864] echo running > 22:37:54.847741224 [2323866] echo running > 22:37:54.848990095 [2323866] echo '[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work' > 22:37:54.848990163 [2323864] echo '[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''02'\'' -w work' > 22:37:54.850215510 [2323866] echo '[ReproMan] ... within /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9' > 22:37:54.850311726 [2323864] echo '[ReproMan] ... within /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9' > 22:37:54.851460292 [2323866] /bin/sh -c 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work' > 22:37:54.851460318 [2323864] /bin/sh -c 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''02'\'' -w work' > 22:37:54.903422378 [2323864] echo 'failed: 255' > 22:37:54.906863733 [2323864] mkdir -p /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200521-223725-6b5a/failed > 22:37:54.912906001 [2323864] touch /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200521-223725-6b5a/failed/0 > 22:37:54.916351237 [2323864] test 2 -eq 1 > 22:48:17.650109139 [2323866] echo succeeded > 22:48:17.650936587 [2323866] test 2 -eq 2 > 22:48:17.651602275 [2323866] sleep 1 > 22:48:18.666950527 [2323866] test 2 -lt 2 > 22:48:18.668432611 [2323866] echo '[ReproMan] post-command...' > 22:48:19.015261930 [2323866] echo 'Using DataLad version datalad 0.12.6.dev51' > 22:48:19.016171283 [2323866] msg='[ReproMan] save results for 20200521-223725-6b5a' > 22:48:19.017042538 [2323866] datalad add --recursive '-m[ReproMan] save results for 20200521-223725-6b5a' . .reproman > 22:48:21.569308645 [2323866] git update-ref refs/reproman/20200521-223725-6b5a HEAD > 22:48:21.571661950 [2323866] mkdir -p /home/yoh/.reproman/run-root/completed/ > 22:48:21.573298257 [2323866] touch /home/yoh/.reproman/run-root/completed/20200521-223725-6b5a (dev3) 1 30896 [1].....................................:Fri 22 May 2020 12:01:13 AM EDT:. (git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200521-223725-6b5a $> cd /home/yoh/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9 containers/ data/ work/ (dev3) 1 30897 [1].....................................:Fri 22 May 2020 12:04:18 AM EDT:. (git-annex)lena:~/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9[master]git $> git status On branch master Your branch is ahead of 'origin/master' by 1 commit. (use "git push" to publish your local commits) Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git restore ..." to discard changes in working directory) modified: .reproman/jobs/local/20200521-223725-6b5a/stderr.1 no changes added to commit (use "git add" and/or "git commit -a") (dev3) 1 30898 [1].....................................:Fri 22 May 2020 12:04:25 AM EDT:. (git-annex)lena:~/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9[master]git $> ls -ld .reproman/jobs/local/20200521-223725-6b5a/stderr.1 -rw------- 1 yoh yoh 10491 May 21 22:48 .reproman/jobs/local/20200521-223725-6b5a/stderr.1 (dev3) 1 30899 [1].....................................:Fri 22 May 2020 12:04:31 AM EDT:. (git-annex)lena:~/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9[master]git $> stat .reproman/jobs/local/20200521-223725-6b5a/stderr.1 File: .reproman/jobs/local/20200521-223725-6b5a/stderr.1 Size: 10491 Blocks: 24 IO Block: 4096 regular file Device: 32h/50d Inode: 39258826 Links: 1 Access: (0600/-rw-------) Uid: (47521/ yoh) Gid: (47522/ yoh) Access: 2020-05-21 22:37:54.815754737 -0400 Modify: 2020-05-21 22:48:21.571557037 -0400 Change: 2020-05-21 22:48:21.571557037 -0400 Birth: - (dev3) 1 30900 [1].....................................:Fri 22 May 2020 12:04:33 AM EDT:. (git-annex)lena:~/.reproman/run-root/1fe9afda-9bd5-11ea-be9b-ff519d1f6bc9[master]git $> git diff .reproman/jobs/local/20200521-223725-6b5a/stderr.1 diff --git a/.reproman/jobs/local/20200521-223725-6b5a/stderr.1 b/.reproman/jobs/local/20200521-223725-6b5a/stderr.1 index 41fc5de..21ef3a3 100644 --- a/.reproman/jobs/local/20200521-223725-6b5a/stderr.1 +++ b/.reproman/jobs/local/20200521-223725-6b5a/stderr.1 @@ -91,3 +91,6 @@ > 22:48:19.015261930 [2323866] echo 'Using DataLad version datalad 0.12.6.dev51' > 22:48:19.016171283 [2323866] msg='[ReproMan] save results for 20200521-223725-6b5a' > 22:48:19.017042538 [2323866] datalad add --recursive '-m[ReproMan] save results for 20200521-223725-6b5a' . .reproman +> 22:48:21.569308645 [2323866] git update-ref refs/reproman/20200521-223725-6b5a HEAD +> 22:48:21.571661950 [2323866] mkdir -p /home/yoh/.reproman/run-root/completed/ +> 22:48:21.573298257 [2323866] touch /home/yoh/.reproman/run-root/completed/20200521-223725-6b5a ```

kyleam commented 4 years ago

Thanks for the idea. On the scratch/use-case-local local branch, I've tried to force unbuffered output (4f05f3aa9). Sadly I still see modified files:

% git status
On branch master
Changes not staged for commit:
    modified:   .reproman/jobs/local/20200522-110817-9d1d/status.0
    modified:   .reproman/jobs/local/20200522-110817-9d1d/stderr.0
    modified:   .reproman/jobs/local/20200522-110817-9d1d/stdout.0
    modified:   data/mriqc (untracked content)

no changes added to commit

% git -C data/mriqc status
HEAD detached from 4c44b18
Untracked files:
    sub-02/anat/
    sub-02_T1w.html

nothing added to commit but untracked files present

This was using the local submitter and datalad-pair orchestrator on reproman's a9c984230 and datalad's 53765be03, with the following command:

CONTAINERS_REPO=$PWD/containers ~/src/python/reproman/docs/usecases/bids-fmriprep-workflow-NP.sh out7

yarikoptic commented 4 years ago

indeed my finding should not explain the data/mriqc changes. I am yet to get to such a state (thanks for exact committish -- I will try them). But as for stdout/stderr I feel that may be the final datalad add/save must simply be ran outside of the runscript even if it would be the sacrifice of the logs for that portion. Even if unbuffered output for logging, I just worry that somehow having those files still open for writing might be effecting it somehow.

yarikoptic commented 4 years ago

woohoo -- with my mighty script which I now pushed to https://github.com/ReproNim/reproman/pull/438/files#diff-3ef8f3b8f632d6ec977a4c41f41474ba I had reproduced your failure:

$> CONTAINERS_REPO=~/proj/repronim/containers ./bids-fmriprep-workflow-NP-reproduce.sh kyle1
..... lots of output , also logged ...
[INFO] Fetching updates for Dataset(/home/yoh/.tmp/rm-gWDCtE6/output/data/mriqc) 
2020-05-22 18:02:53,862 [INFO] Getting outputs from 'local' 
2020-05-22 18:02:54,208 [WARNING] 1 subjob failed. Check files in /home/yoh/.tmp/rm-gWDCtE6/output/.reproman/jobs/local/20200522-174214-7eb6/ 
2020-05-22 18:02:54,208 [INFO] 20200522-174214-7eb6 stderr: /home/yoh/.tmp/rm-gWDCtE6/output/.reproman/jobs/local/20200522-174214-7eb6/stderr.1 
2020-05-22 18:02:54,948 [INFO] Finished with remote resource 'local' 
2020-05-22 18:02:54,949 [INFO] Unregistered job 20200522-174214-7eb6 
ex:0 > case "$do_group" in
ex:0 > reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work
ex:0 > reproman run --follow -r local --sub local --orc datalad-pair --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work
2020-05-22 18:02:57,011 [INFO] No root directory supplied for local; using '/home/yoh/.reproman/run-root' 
2020-05-22 18:02:57,264 [ERROR] Remote repository /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9 is dirty [orchestrators.py:_assert_clean_repo:673] (OrchestratorError) 
> echo Finished for setup=kyle1 under PWD=/home/yoh/.tmp/rm-gWDCtE6
Finished for setup=kyle1 under PWD=/home/yoh/.tmp/rm-gWDCtE6

$> datalad status -r -d /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9
 modified: .reproman/jobs/local/20200522-174214-7eb6/status.0 (file)
 modified: .reproman/jobs/local/20200522-174214-7eb6/stderr.0 (file)
 modified: .reproman/jobs/local/20200522-174214-7eb6/stdout.0 (file)
 modified: data/mriqc (dataset)
untracked: data/mriqc/sub-02 (directory)
untracked: data/mriqc/sub-02_T1w.html (file)
untracked: data/mriqc/sub-02_task-rhymejudgment_bold.html (file)

so it is for the succeeded job for which logs and/or results are not saved. The other one not there altogether (for sub 13,not 02) since failed due to locking issue:

(git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200522-174214-7eb6
$> cat stdout.1
[ReproMan] pre-command...
[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing 'data/bids' 'data/mriqc' participant --participant_label '13' -w work[ReproMan] ... within /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9[ReproMan] post-command...
Using DataLad version datalad 0.13.0rc1

$> cat stderr.1
error: could not lock config file /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9/containers/binds/HOME/.gitconfig: File exists

$> cat status.1
failed: 255

I do not have a clue yet about locking... I guess I should place my #513 on top of your state and get more detailed info logged - may be would give a hint.

yarikoptic commented 4 years ago

FWIW: I think I got it! (the "updated dataset")... yet to check... edit: pushed tentative fix to https://github.com/ReproNim/reproman/commit/23c5974c8a5b53dfe6fb0d4d9bbebc6bc6462369

kyleam commented 4 years ago

edit: pushed tentative fix to 23c5974

Hmm, true, that makes sense given the file will be created at the start of the command. Thanks.

yarikoptic commented 4 years ago

FWIW -- it worked! the run completed nicely. Could I leave it to you to make into a proper fix? Note that I rewrote it into ~~295d9d23548394d3c70bfc63fca6d75378099727~~ c325b01b5e8f4301dc2fdbf1b2aab8bb17f423cd (-e was incorrect to use there). But it might be not available on OSX etc, and probably find is not strictly necessary then.

I still think that datalad add/save should not be called within the same script to not possibly cause some output yet to be saved. Could be a separate post-runscript script which is not expected to produce any output to be saved and if does -- logged at reproman level.

yarikoptic commented 4 years ago

re locking -- aha -- thanks to PS4 output it is due to singularity_cmd shim in ReproNim/containers:

(git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200522-000958-46c0
$> grep 'not lock' -3 stderr.1
> 00:10:34.142854375 [2357873] echo '[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work'
> 00:10:34.146979943 [2357873] echo '[ReproMan] ... within /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9'
> 00:10:34.149859825 [2357873] /bin/sh -c 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work'
error: could not lock config file /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/containers/binds/HOME/.gitconfig: File exists
> 00:10:34.211776551 [2357873] echo 'failed: 255'
> 00:10:34.214277774 [2357873] mkdir -p /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200522-000958-46c0/failed
> 00:10:34.216103284 [2357873] touch /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200522-000958-46c0/failed/1

so most likely where I am trying to setup config in $HOME: https://github.com/ReproNim/containers/blob/master/scripts/singularity_cmd#L49 which is in a bloody a quick succession, quite amazing that it is already enough to cause race condition. Anyways -- now filed there https://github.com/ReproNim/containers/issues/43 and will fix it up later today

kyleam commented 4 years ago

I've added added a variation of your fix, as well as Christian's original macOS find fix, to gh-514.

I still think that datalad add/save should not be called within the same script to not possibly cause some output yet to be saved. Could be a separate post-runscript script which is not expected to produce any output to be saved and if does

I'm not opposed to the idea, but I don't see a way to do that that would work across the different submitters.

kyleam commented 4 years ago

I tried with the latest push to scratch/use-case-local (23d2ce696) and got much farther.

$ FS_LICENSE=$PWD/license.txt CONTAINERS_REPO=$PWD/containers /home/kyle/src/python/reproman/docs/usecases/bids-fmriprep-workflow-NP.sh outN

My first attempt failed early with the license-related error message shown in 23d2ce696. That might be me misunderstanding how I'm supposed to specify the license, but to get passed it I just added the license to git rather than git-annex (this is all local anyway). And that got all the way through the fmriprep run. There was however a license-related error during that run. Dunno about that, but overall it's good to see a complete run on my end.

error in data/fmriprep/freesurfer/sub-13/scripts/recon-all.log

``` GNU libc version: 2.23 ERROR: Systems running GNU glibc version greater than 2.15 require a newly formatted license file (it's free). Please download a new one from the following page: http://surfer.nmr.mgh.harvard.edu/registration.html ------------------------------------------------------- ```

yarikoptic commented 4 years ago

I have not yet dared to try full cycle, will do now! Is it the license you have on smaug under ~kyle/.freesurfer-license?

kyleam commented 4 years ago

Is it the license you have on smaug under ~kyle/.freesurfer-license

No, it's one I just obtained this week.

yarikoptic commented 4 years ago

re license and git-annex: I guess nothing really anywhere tells annex to transfer the license file content to the actual place where execution happens. It is one of those "extra_inputs" which I believe we do not expose at the API/CLI level in datalad (containers-|)run, so I guess there is no way to instruct reproman run to upload datalad publish/push that file, right?

yarikoptic commented 4 years ago

Is it the license you have on smaug under ~kyle/.freesurfer-license

No, it's one I just obtained this week.

oh, that is strange. ok - let me try the full loop (straight on smaug) with that license I/we have there.

kyleam commented 4 years ago

I guess nothing really anywhere tells annex to transfer the license file content to the actual place where execution happens

Right, but I thought a step that required the license was working for you at some point.

kyleam commented 4 years ago

It is one of those "extra_inputs" which I believe we do not expose at the API/CLI level in datalad (containers-|)run, so I guess there is no way to instruct reproman run to upload datalad publish/push that file, right?

Can't you adjust your script to specify it as an input for the reproman run call? As discussed before, reproman run doesn't use datalad run or datalad containers-run on the remote, and those commands do not expose an --extra-input option.

yarikoptic commented 4 years ago

Can't you adjust your script to specify it as an input for the reproman run call?

will do... I had "improved" and pushed my -reproduce script to create a new isolated HOME etc (pushed now into #438) and with current datalad 0.12.7 and reproman master (reproman-master in that script), I had failed with a new to me failure I believe:

(git)smaug:~/proj/repronim/reproman-master[doc-usecases]git
$> git describe; bash docs/usecases/bids-fmriprep-workflow-NP-reproduce.sh reproman-master
v0.2.1-100-g7920234
> setup=reproman-master
>> mktemp -d /home/yoh/.tmp/rm-XXXXXXX
> cd /home/yoh/.tmp/rm-tfbYoMv
> mkdir HOME
> cp /home/yoh/.gitconfig HOME/
> cp /home/yoh/.freesurfer-license HOME/
> export HOME=/home/yoh/.tmp/rm-tfbYoMv/HOME
...
2020-05-27 16:07:27,048 [INFO] Registered job 20200527-160657-0490
2020-05-27 16:07:27,076 [INFO] Waiting on job 2929230: running. Next heartbeat in 5 seconds
2020-05-27 16:07:27,208 [ERROR] CommandError: command 'git cat-file -p refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0' failed with exitcode 128
| Failed to run 'git cat-file -p refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0' under '/home/yoh/.tmp/rm-tfbYoMv/HOME/.reproman/run-root/940832c2-a055-11ea-a8dd-002590f97d84'. Exit code=128. out= err=fatal: Not a valid object name refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0
|  [orchestrators.py:_execute_in_wdir:641] (OrchestratorError)
> echo Finished for setup=reproman-master under PWD=/home/yoh/.tmp/rm-tfbYoMv
Finished for setup=reproman-master under PWD=/home/yoh/.tmp/rm-tfbYoMv
bash docs/usecases/bids-fmriprep-workflow-NP-reproduce.sh reproman-master  36.65s user 18.32s system 75% cpu 1:12.58 total

could that somehow be a side effect from "custom" HOME?

kyleam commented 4 years ago

could that somehow be a side effect from "custom" HOME?

I don't know.

datalad 0.12.7 and reproman master

If merging gh-506 will make you actually use it, then let's do that. We have enough different failures as is, so there's no sense chasing things were fixed in Datalad and that branch.

yarikoptic commented 4 years ago

If merging gh-506 will make you actually use it, then let's do that.

YES! you or me?

could that somehow be a side effect from "custom" HOME?

I don't know.

Seems to be, commenting out setting up HOME there avoids the puke. I will check it again after we merge and make sure that it works after we merge #506

yarikoptic commented 4 years ago

YES! you or me?

asking because we are yet to add datalad ~= 0.13.0 to setup.py so it gets installed upon e.g. [datalad] installation of reproman

yarikoptic commented 4 years ago

Seems to be, commenting out setting up HOME there avoids the puke. I will check it again after we merge and make sure that it works after we merge #506

scratch that -- it did fail similarly later. must be something smaug specific (haven't ran there reproman run for a while so have no baseline)

kyleam commented 4 years ago

If merging gh-506 will make you actually use it, then let's do that.

YES! you or me?

I'll update the setup.py and take it out of draft. Will leave the merge to you.

I will check it again after we merge and make sure that it works after we merge #506

To be clear, I don't think a more recent datalad or that branch fixes the issue.

yarikoptic commented 4 years ago

Can't you adjust your script to specify it as an input for the reproman run call?

will do...

FWIW, seems to work with some changes... will push later when sure that all other changes seems to be ok as well

yarikoptic commented 4 years ago

With the recent success this issue can RiP

ReproNim / reproman

Report on running #438 usecases with datalad 0.13.0rc1 and "local" resource #511

1. The overall failure was due to

2. In the log you could see above this failure