Closed yarikoptic closed 4 years ago
[publish failure]
I wish ERROR message provided a more informative message (i.e. how it failed, some stderr or returned record from
datalad publish
invocation if done in Python).
Indeed, that was pretty hard to do anything with. It should be more helpful after 45baab076 (ENH: orchestrators: Provide access to datalad-publish results, 2020-05-14), which is in gh-506.
Running with local submitter and datalad-pair-run
(I still can't figure out why my tree is left dirty when using datalad-pair
), I was able to trigger a publish error, or rather multiple. The first one a saw was due to another "git tries to fetch non-existent origin" issue (gh-499). A similar issue was fixed in update
, but this one is in publish
.
DataLad's 8343899d3 (BF: publish: Tell git-fetch to not recurse into submodules, 2020-05-20), which you reviewed/merged yesterday (thanks!), resolves this issue. Running things again gets farther, but hits another publish
error. Based on the spot were the failure happened in your log, I'd guess this might be the one you hit.
[ERROR ] 'datalad publish' failed: {'action': 'publish', 'path': '/home/kyle/scratch/reproman/gh-438/out6/data/fmriprep', 'type': 'dataset', 'refds': '/home/kyle/scratch/reproman/gh-438/out6', 'status': 'error', 'mode_src': 0, 'mode': 57344, 'revision_src': None, 'revision': '0cb506b90c06a7d7d7b88e30d2808fb50d7ebc27', 'parentds': '/home/kyle/scratch/reproman/gh-438/out6', 'state': 'added', 'type_src': None, 'registered_subds': True, 'process_content': True, 'message': ("Unknown target sibling '%s' for publication", 'local')} [orchestrators.py:prepare_remote:792] (OrchestratorError
So, it looks like run
needs to do a better job of setting up the sibling in subdatasets that pop up after the initial run. You mentioned that running the suggested publish
command manually worked, but I think that's just because the suggestion left out --recursive
:/
I just pushed b70144e993660c271831e4ea8d2f4bb436bb7eeb which adds those few env vars to avoid needing to patch script altogether just to e.g. run mriqc and have no FS license. Was about to finish the script to ease reproducing etc, but probably after hangout now
"cool" -- rerun completed with the same dirty failure but there is no failed job! status says succeeded for both. Will dig now into timing of things by adding PS4 with datetime and set -x
FWIW: gkhm... dash
just stalls at PS4 definition line if it has some $()
to run the date inside... bizzare
so I think I will first try to replicate it with bash
as the shell in shebang, and then will add timing.
I think I got an idea for a reason, please correct me if I am wrong: output from runscript
is getting dumped into those stdout/err files. But it is the same runscript (for the last job) which invokes datalad add/save
. So the stdout/err file is still open and might be not flushed while datalad add
commits it, then it exits possibly flushing it (although possibly not introducing any additional output), and thus leaving it in a new "dirty" state. Got an idea from looking at the output with PS4 and set -x where the diff was the last debug lines which included running add
and follow up commands, and timestamp for modification of that file was almost a minute away from when add
was invoked:
Thanks for the idea. On the scratch/use-case-local local branch, I've tried to force unbuffered output (4f05f3aa9). Sadly I still see modified files:
% git status
On branch master
Changes not staged for commit:
modified: .reproman/jobs/local/20200522-110817-9d1d/status.0
modified: .reproman/jobs/local/20200522-110817-9d1d/stderr.0
modified: .reproman/jobs/local/20200522-110817-9d1d/stdout.0
modified: data/mriqc (untracked content)
no changes added to commit
% git -C data/mriqc status
HEAD detached from 4c44b18
Untracked files:
sub-02/anat/
sub-02_T1w.html
nothing added to commit but untracked files present
This was using the local submitter and datalad-pair orchestrator on reproman's a9c984230 and datalad's 53765be03, with the following command:
CONTAINERS_REPO=$PWD/containers ~/src/python/reproman/docs/usecases/bids-fmriprep-workflow-NP.sh out7
indeed my finding should not explain the data/mriqc
changes. I am yet to get to such a state (thanks for exact committish -- I will try them). But as for stdout/stderr I feel that may be the final datalad add/save
must simply be ran outside of the runscript
even if it would be the sacrifice of the logs for that portion. Even if unbuffered output for logging, I just worry that somehow having those files still open for writing might be effecting it somehow.
woohoo -- with my mighty script which I now pushed to https://github.com/ReproNim/reproman/pull/438/files#diff-3ef8f3b8f632d6ec977a4c41f41474ba I had reproduced your failure:
$> CONTAINERS_REPO=~/proj/repronim/containers ./bids-fmriprep-workflow-NP-reproduce.sh kyle1
..... lots of output , also logged ...
[INFO] Fetching updates for Dataset(/home/yoh/.tmp/rm-gWDCtE6/output/data/mriqc)
2020-05-22 18:02:53,862 [INFO] Getting outputs from 'local'
2020-05-22 18:02:54,208 [WARNING] 1 subjob failed. Check files in /home/yoh/.tmp/rm-gWDCtE6/output/.reproman/jobs/local/20200522-174214-7eb6/
2020-05-22 18:02:54,208 [INFO] 20200522-174214-7eb6 stderr: /home/yoh/.tmp/rm-gWDCtE6/output/.reproman/jobs/local/20200522-174214-7eb6/stderr.1
2020-05-22 18:02:54,948 [INFO] Finished with remote resource 'local'
2020-05-22 18:02:54,949 [INFO] Unregistered job 20200522-174214-7eb6
ex:0 > case "$do_group" in
ex:0 > reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work
ex:0 > reproman run --follow -r local --sub local --orc datalad-pair --jp container=containers/bids-mriqc --input data/bids --output data/mriqc '{inputs}' '{outputs}' group -w work
2020-05-22 18:02:57,011 [INFO] No root directory supplied for local; using '/home/yoh/.reproman/run-root'
2020-05-22 18:02:57,264 [ERROR] Remote repository /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9 is dirty [orchestrators.py:_assert_clean_repo:673] (OrchestratorError)
> echo Finished for setup=kyle1 under PWD=/home/yoh/.tmp/rm-gWDCtE6
Finished for setup=kyle1 under PWD=/home/yoh/.tmp/rm-gWDCtE6
$> datalad status -r -d /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9
modified: .reproman/jobs/local/20200522-174214-7eb6/status.0 (file)
modified: .reproman/jobs/local/20200522-174214-7eb6/stderr.0 (file)
modified: .reproman/jobs/local/20200522-174214-7eb6/stdout.0 (file)
modified: data/mriqc (dataset)
untracked: data/mriqc/sub-02 (directory)
untracked: data/mriqc/sub-02_T1w.html (file)
untracked: data/mriqc/sub-02_task-rhymejudgment_bold.html (file)
so it is for the succeeded job for which logs and/or results are not saved. The other one not there altogether (for sub 13,not 02) since failed due to locking issue:
(git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200522-174214-7eb6
$> cat stdout.1
[ReproMan] pre-command...
[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing 'data/bids' 'data/mriqc' participant --participant_label '13' -w work[ReproMan] ... within /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9[ReproMan] post-command...
Using DataLad version datalad 0.13.0rc1
$> cat stderr.1
error: could not lock config file /home/yoh/.reproman/run-root/0dbcc14e-9c75-11ea-be9b-ff519d1f6bc9/containers/binds/HOME/.gitconfig: File exists
$> cat status.1
failed: 255
I do not have a clue yet about locking... I guess I should place my #513 on top of your state and get more detailed info logged - may be would give a hint.
FWIW: I think I got it! (the "updated dataset")... yet to check... edit: pushed tentative fix to https://github.com/ReproNim/reproman/commit/23c5974c8a5b53dfe6fb0d4d9bbebc6bc6462369
edit: pushed tentative fix to 23c5974
Hmm, true, that makes sense given the file will be created at the start of the command. Thanks.
FWIW -- it worked! the run completed nicely. Could I leave it to you to make into a proper fix? Note that I rewrote it into 295d9d23548394d3c70bfc63fca6d75378099727 c325b01b5e8f4301dc2fdbf1b2aab8bb17f423cd (-e
was incorrect to use there). But it might be not available on OSX etc, and probably find
is not strictly necessary then.
I still think that datalad add/save
should not be called within the same script to not possibly cause some output yet to be saved. Could be a separate post-runscript
script which is not expected to produce any output to be saved and if does -- logged at reproman level.
re locking -- aha -- thanks to PS4 output it is due to singularity_cmd shim in ReproNim/containers:
(git-annex)lena:…11ea-be9b-ff519d1f6bc9[master].reproman/jobs/local/20200522-000958-46c0
$> grep 'not lock' -3 stderr.1
> 00:10:34.142854375 [2357873] echo '[ReproMan] executing command containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work'
> 00:10:34.146979943 [2357873] echo '[ReproMan] ... within /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9'
> 00:10:34.149859825 [2357873] /bin/sh -c 'containers/scripts/singularity_cmd run containers/images/bids/bids-mriqc--0.15.0.sing '\''data/bids'\'' '\''data/mriqc'\'' participant --participant_label '\''13'\'' -w work'
error: could not lock config file /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/containers/binds/HOME/.gitconfig: File exists
> 00:10:34.211776551 [2357873] echo 'failed: 255'
> 00:10:34.214277774 [2357873] mkdir -p /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200522-000958-46c0/failed
> 00:10:34.216103284 [2357873] touch /home/yoh/.reproman/run-root/0a372e4e-9be2-11ea-be9b-ff519d1f6bc9/.reproman/jobs/local/20200522-000958-46c0/failed/1
so most likely where I am trying to setup config in $HOME: https://github.com/ReproNim/containers/blob/master/scripts/singularity_cmd#L49 which is in a bloody a quick succession, quite amazing that it is already enough to cause race condition. Anyways -- now filed there https://github.com/ReproNim/containers/issues/43 and will fix it up later today
I've added added a variation of your fix, as well as Christian's original macOS find fix, to gh-514.
I still think that
datalad add/save
should not be called within the same script to not possibly cause some output yet to be saved. Could be a separatepost-runscript
script which is not expected to produce any output to be saved and if does
I'm not opposed to the idea, but I don't see a way to do that that would work across the different submitters.
I tried with the latest push to scratch/use-case-local (23d2ce696) and got much farther.
$ FS_LICENSE=$PWD/license.txt CONTAINERS_REPO=$PWD/containers /home/kyle/src/python/reproman/docs/usecases/bids-fmriprep-workflow-NP.sh outN
My first attempt failed early with the license-related error message shown in 23d2ce696. That might be me misunderstanding how I'm supposed to specify the license, but to get passed it I just added the license to git rather than git-annex (this is all local anyway). And that got all the way through the fmriprep run. There was however a license-related error during that run. Dunno about that, but overall it's good to see a complete run on my end.
I have not yet dared to try full cycle, will do now! Is it the license you have on smaug under ~kyle/.freesurfer-license
?
Is it the license you have on smaug under
~kyle/.freesurfer-license
No, it's one I just obtained this week.
re license and git-annex: I guess nothing really anywhere tells annex to transfer the license file content to the actual place where execution happens. It is one of those "extra_inputs" which I believe we do not expose at the API/CLI level in datalad (containers-|)run
, so I guess there is no way to instruct reproman run
to upload datalad publish/push that file, right?
Is it the license you have on smaug under
~kyle/.freesurfer-license
No, it's one I just obtained this week.
oh, that is strange. ok - let me try the full loop (straight on smaug) with that license I/we have there.
I guess nothing really anywhere tells annex to transfer the license file content to the actual place where execution happens
Right, but I thought a step that required the license was working for you at some point.
It is one of those "extra_inputs" which I believe we do not expose at the API/CLI level in
datalad (containers-|)run
, so I guess there is no way to instructreproman run
to upload datalad publish/push that file, right?
Can't you adjust your script to specify it as an input for the reproman run
call? As discussed before, reproman run
doesn't use datalad run
or datalad containers-run
on the remote, and those commands do not expose an --extra-input option.
Can't you adjust your script to specify it as an input for the
reproman run
call?
will do... I had "improved" and pushed my -reproduce script to create a new isolated HOME etc (pushed now into #438) and with current datalad 0.12.7 and reproman master (reproman-master in that script), I had failed with a new to me failure I believe:
(git)smaug:~/proj/repronim/reproman-master[doc-usecases]git
$> git describe; bash docs/usecases/bids-fmriprep-workflow-NP-reproduce.sh reproman-master
v0.2.1-100-g7920234
> setup=reproman-master
>> mktemp -d /home/yoh/.tmp/rm-XXXXXXX
> cd /home/yoh/.tmp/rm-tfbYoMv
> mkdir HOME
> cp /home/yoh/.gitconfig HOME/
> cp /home/yoh/.freesurfer-license HOME/
> export HOME=/home/yoh/.tmp/rm-tfbYoMv/HOME
...
2020-05-27 16:07:27,048 [INFO] Registered job 20200527-160657-0490
2020-05-27 16:07:27,076 [INFO] Waiting on job 2929230: running. Next heartbeat in 5 seconds
2020-05-27 16:07:27,208 [ERROR] CommandError: command 'git cat-file -p refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0' failed with exitcode 128
| Failed to run 'git cat-file -p refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0' under '/home/yoh/.tmp/rm-tfbYoMv/HOME/.reproman/run-root/940832c2-a055-11ea-a8dd-002590f97d84'. Exit code=128. out= err=fatal: Not a valid object name refs/reproman/20200527-160657-0490:.reproman/jobs/local/20200527-160657-0490/status.0
| [orchestrators.py:_execute_in_wdir:641] (OrchestratorError)
> echo Finished for setup=reproman-master under PWD=/home/yoh/.tmp/rm-tfbYoMv
Finished for setup=reproman-master under PWD=/home/yoh/.tmp/rm-tfbYoMv
bash docs/usecases/bids-fmriprep-workflow-NP-reproduce.sh reproman-master 36.65s user 18.32s system 75% cpu 1:12.58 total
could that somehow be a side effect from "custom" HOME?
could that somehow be a side effect from "custom" HOME?
I don't know.
datalad 0.12.7 and reproman master
If merging gh-506 will make you actually use it, then let's do that. We have enough different failures as is, so there's no sense chasing things were fixed in Datalad and that branch.
If merging gh-506 will make you actually use it, then let's do that.
YES! you or me?
could that somehow be a side effect from "custom" HOME?
I don't know.
Seems to be, commenting out setting up HOME there avoids the puke. I will check it again after we merge and make sure that it works after we merge #506
YES! you or me?
asking because we are yet to add datalad ~= 0.13.0
to setup.py so it gets installed upon e.g. [datalad]
installation of reproman
Seems to be, commenting out setting up HOME there avoids the puke. I will check it again after we merge and make sure that it works after we merge #506
scratch that -- it did fail similarly later. must be something smaug specific (haven't ran there reproman run
for a while so have no baseline)
If merging gh-506 will make you actually use it, then let's do that.
YES! you or me?
I'll update the setup.py and take it out of draft. Will leave the merge to you.
I will check it again after we merge and make sure that it works after we merge #506
To be clear, I don't think a more recent datalad or that branch fixes the issue.
Can't you adjust your script to specify it as an input for the
reproman run
call?will do...
FWIW, seems to work with some changes... will push later when sure that all other changes seems to be ok as well
With the recent success this issue can RiP
This was an attempt to run with @kyleam 's version of the script with only minor tune ups, as available from the branch scratch/use-case-local here. Command I had was
and full log try1.log is now at http://www.onerussian.com/tmp/reproman-try1.txt (renamed).
1. The overall failure was due to
while it went just fine upon checking and retrying after crash
```shell lena:~/temp-test-reproman $> datalad status -d /home/yoh/temp-test-reproman/try1 nothing to save, working tree clean $> datalad siblings -d /home/yoh/temp-test-reproman/try1 .: here(+) [git] .: local(+) [/home/yoh/.reproman/run-root/ac410208-90ae-11ea-889c-c3382a41054b (git)] $> datalad status -d /home/yoh/.reproman/run-root/ac410208-90ae-11ea-889c-c3382a41054b nothing to save, working tree clean $> datalad publish -d /home/yoh/temp-test-reproman/try1 --to=local ```which is strange. I wish ERROR message provided a more informative message (i.e. how it failed, some stderr or returned record from
datalad publish
invocation if done in Python).2. In the log you could see above this failure
for running mriqc. so there is a WARNING but overall it seems it didn't cause the process to fail. @kyleam shouldn't overall
run --follow
fail if subjob fails?trusting noone including myself I have tested that it is not somehow my demo script which should fail if individual command fails due to `set -e` on the top... it fails as expected with the following patch
```shell (git)lena:~/proj/repronim/reproman[scratch/use-case-local] $> git diff diff --git a/docs/usecases/bids-fmriprep-workflow-NP.sh b/docs/usecases/bids-fmriprep-workflow-NP.sh index 3ac4eca..c54473c 100755 --- a/docs/usecases/bids-fmriprep-workflow-NP.sh +++ b/docs/usecases/bids-fmriprep-workflow-NP.sh @@ -97,7 +97,8 @@ unknown_runner () { # Common invocation of ReproMan # TODO: just make it configurable per project/env? reproman_run () { - reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" "$@" + /bin/ls bogus + # reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" "$@" } ```FTR, the content of that stderr file was
my singularity-container from neurodebian is 2.6.1-2~nd100+1
DataLad 0.13.0rc1 WTF (dependencies, datalad)
# WTF ## datalad - full_version: 0.13.0rc1 - version: 0.13.0rc1 ## dependencies - appdirs: 1.4.3 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 7.20190708+git9-gfa3524b95-1~ndall+1 - cmd:bundled-git: 2.20.1 - cmd:git: 2.20.1 - cmd:system-git: 2.26.2 - cmd:system-ssh: 8.1p1 - exifread: 2.1.2 - git: 3.1.0 - gitdb: 4.0.2 - humanize: 2.3.0 - iso8601: 0.1.12 - keyring: 18.0.1 - keyrings.alt: 3.4.0 - msgpack: 0.6.2 - mutagen: 1.40.0 - requests: 2.23.0 - tqdm: 4.43.0 - wrapt: 1.11.2