ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

Does reproman support globstar for --outputs? #581

Open jbwexler opened 2 years ago

jbwexler commented 2 years ago

I had a few failed subjects from an fmriprep run and tried to rerun from where they left off. However, I ran into a bunch of "permission denied" errors from the pre-existing outputs, despite the fact that I specified these as outputs in the reproman command using globstar to capture files recursively. Perhaps reproman doesn't support globstar yet? Or I'm making some error?

I'm using reproman v0.4.1, datalad v0.15.4 and git-annex v8.20210903-ga4d179c

Here's are a few example errors:

> PermissionError: [Errno 13] Permission denied: '/scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep/sub-16/ses-mri/func/sub-16_ses-mri_task-facerecognition_run-9_desc-preproc_bold.nii.gz'

> Standard error:
> cp: cannot create regular file '/scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep/sourcedata/freesurfer/sub-16/scripts/lastcall.build-stamp.txt': Permission denied
> /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep/sourcedata/freesurfer/sub-16/scripts/recon-all.cmd: Permission denied.
> Return code: 1
Here's the spec.yaml: ```shell > (main) login1.frontera(1038)$ cat .reproman/jobs/local/20220106-161435-2d54/spec.yaml > _command_array: > - code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-fmriprep--21.0.0.sing > sourcedata/raw /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > participant --participant-label '02' -w '/scratch1/03201/jbwexler/work_dir/fmriprep//ds000117_sub-02' > -vv --output-spaces MNI152NLin2009cAsym:res-2 anat func fsaverage5 --nthreads 14 > --omp-nthreads 7 --skip-bids-validation --notrack --fs-license-file /home1/03201/jbwexler/.freesurfer.txt > --use-aroma --ignore slicetiming --output-layout bids --cifti-output --resource-monitor > --skull-strip-t1w force --mem_mb 37500 --bids-database-dir /tmp --use-syn-sdc > - code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-fmriprep--21.0.0.sing > sourcedata/raw /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > participant --participant-label '16' -w '/scratch1/03201/jbwexler/work_dir/fmriprep//ds000117_sub-16' > -vv --output-spaces MNI152NLin2009cAsym:res-2 anat func fsaverage5 --nthreads 14 > --omp-nthreads 7 --skip-bids-validation --notrack --fs-license-file /home1/03201/jbwexler/.freesurfer.txt > --use-aroma --ignore slicetiming --output-layout bids --cifti-output --resource-monitor > --skull-strip-t1w force --mem_mb 37500 --bids-database-dir /tmp --use-syn-sdc > _container_command_str: code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-fmriprep--21.0.0.sing > sourcedata/raw /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > participant --participant-label '{p[sub]}' -w '/scratch1/03201/jbwexler/work_dir/fmriprep//ds000117_sub-{p[sub]}' > -vv --output-spaces MNI152NLin2009cAsym:res-2 anat func fsaverage5 --nthreads 14 > --omp-nthreads 7 --skip-bids-validation --notrack --fs-license-file /home1/03201/jbwexler/.freesurfer.txt > --use-aroma --ignore slicetiming --output-layout bids --cifti-output --resource-monitor > --skull-strip-t1w force --mem_mb 37500 --bids-database-dir /tmp --use-syn-sdc > _dataset_id: 6c997266-c091-49c7-845a-ebd84e38c046 > _extra_inputs: &id001 > - code/containers/images/bids/bids-fmriprep--21.0.0.sing > _extra_inputs_array: > - *id001 > - *id001 > _head: 7a04dfa852ea2a57d740db3a1d2f31160f3babfc > _inputs_array: [] > _jobid: 20220106-161435-2d54 > _meta_directory: /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep/.reproman/jobs/local/20220106-161435-2d54 > _meta_directory_rel: .reproman/jobs/local/20220106-161435-2d54 > _num_subjobs: 2 > _outputs_array: > - - sub-01 > - sub-01.html > - sub-02 > - sub-02.html > - sub-03 > - sub-03.html > - sub-04 > - sub-04.html > - sub-05 > - sub-05.html > - sub-06 > - sub-06.html > - sub-07 > - sub-07.html > - sub-08 > - sub-08.html > - sub-09 > - sub-09.html > - sub-10 > - sub-10.html > - sub-11 > - sub-11.html > - sub-12 > - sub-12.html > - sub-13 > - sub-13.html > - sub-14 > - sub-14.html > - sub-15 > - sub-15.html > - sub-16 > - sub-16.html > - sourcedata/freesurfer/fsaverage > - sourcedata/freesurfer/fsaverage5 > - sourcedata/freesurfer/sub-01 > - sourcedata/freesurfer/sub-02 > - sourcedata/freesurfer/sub-03 > - sourcedata/freesurfer/sub-04 > - sourcedata/freesurfer/sub-05 > - sourcedata/freesurfer/sub-06 > - sourcedata/freesurfer/sub-07 > - sourcedata/freesurfer/sub-08 > - sourcedata/freesurfer/sub-09 > - sourcedata/freesurfer/sub-10 > - sourcedata/freesurfer/sub-11 > - sourcedata/freesurfer/sub-12 > - sourcedata/freesurfer/sub-13 > - sourcedata/freesurfer/sub-14 > - sourcedata/freesurfer/sub-15 > - sourcedata/freesurfer/sub-16 > - - sub-01 > - sub-01.html > - sub-02 > - sub-02.html > - sub-03 > - sub-03.html > - sub-04 > - sub-04.html > - sub-05 > - sub-05.html > - sub-06 > - sub-06.html > - sub-07 > - sub-07.html > - sub-08 > - sub-08.html > - sub-09 > - sub-09.html > - sub-10 > - sub-10.html > - sub-11 > - sub-11.html > - sub-12 > - sub-12.html > - sub-13 > - sub-13.html > - sub-14 > - sub-14.html > - sub-15 > - sub-15.html > - sub-16 > - sub-16.html > - sourcedata/freesurfer/fsaverage > - sourcedata/freesurfer/fsaverage5 > - sourcedata/freesurfer/sub-01 > - sourcedata/freesurfer/sub-02 > - sourcedata/freesurfer/sub-03 > - sourcedata/freesurfer/sub-04 > - sourcedata/freesurfer/sub-05 > - sourcedata/freesurfer/sub-06 > - sourcedata/freesurfer/sub-07 > - sourcedata/freesurfer/sub-08 > - sourcedata/freesurfer/sub-09 > - sourcedata/freesurfer/sub-10 > - sourcedata/freesurfer/sub-11 > - sourcedata/freesurfer/sub-12 > - sourcedata/freesurfer/sub-13 > - sourcedata/freesurfer/sub-14 > - sourcedata/freesurfer/sub-15 > - sourcedata/freesurfer/sub-16 > _reproman_version: 0.4.1 > _resolved_batch_parameters: > - sub: '02' > - sub: '16' > _resolved_command_str: sourcedata/raw /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > participant --participant-label '{p[sub]}' -w '/scratch1/03201/jbwexler/work_dir/fmriprep//ds000117_sub-{p[sub]}' > -vv --output-spaces MNI152NLin2009cAsym:res-2 anat func fsaverage5 --nthreads 14 > --omp-nthreads 7 --skip-bids-validation --notrack --fs-license-file /home1/03201/jbwexler/.freesurfer.txt > --use-aroma --ignore slicetiming --output-layout bids --cifti-output --resource-monitor > --skull-strip-t1w force --mem_mb 37500 --bids-database-dir /tmp --use-syn-sdc > _spec_version: '1.0' > _submission_id: null > batch_parameters: > - sub=02,16 > command: > - sourcedata/raw > - /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > - participant > - --participant-label > - '{p[sub]}' > - -w > - /scratch1/03201/jbwexler/work_dir/fmriprep//ds000117_sub-{p[sub]} > - -vv > - --output-spaces > - MNI152NLin2009cAsym:res-2 > - anat > - func > - fsaverage5 > - --nthreads > - '14' > - --omp-nthreads > - '7' > - --skip-bids-validation > - --notrack > - --fs-license-file > - /home1/03201/jbwexler/.freesurfer.txt > - --use-aroma > - --ignore > - slicetiming > - --output-layout > - bids > - --cifti-output > - --resource-monitor > - --skull-strip-t1w > - force > - --mem_mb > - '37500' > - --bids-database-dir > - /tmp > - --use-syn-sdc > container: code/containers/bids-fmriprep > job_name: ds000117-fmriprep > killjob_factors: .75,.15 > launcher: 'true' > local_directory: /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep > mail_type: END > mail_user: jbwexler@tutanota.com > num_nodes: '1' > num_processes: '2' > orchestrator: datalad-no-remote > outputs: > - sub** > - sourcedata/freesurfer/** > queue: small > resource_id: bc1235e8-b28c-11eb-bce1-e4434b618f52 > resource_name: local > root_directory: /home1/03201/jbwexler/.reproman/run-root > submitter: slurm > walltime: '48:00:00' > working_directory: /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds000117-fmriprep ```

edit 1 by @yarikoptic: some formatting and added collapsing details around long paste

yarikoptic commented 2 years ago

yes, globstar is not supported AFAIK. Only recently we started to match ** in datalad: https://github.com/datalad/datalad/pull/6262 and I think we should do the same here in reproman. I believe GlobbedPaths of https://github.com/ReproNim/reproman/blob/HEAD/reproman/support/globbedpaths.py should get recursive=True constructor option and use it within glob.glob. Interested to submit a PR?

But I am not 100% certain that is the issue here since unlock should just unlock all files under sub** (which would be just sub* without globstar -- all subject directories) folders. They aren't subdatasets, are they?

jbwexler commented 2 years ago

Thanks, sure I can work on a PR in a few weeks when I get some time.

No they aren't subdatasets. So unlock is already recursive? What is the benefit of globstar then?

jbwexler commented 2 years ago

Regardless, you are right that the issue isn't that the files weren't unlocked. I tried again with a sort of manual version of globstar like so:

--output "sub*" --output "sub*/*" --output "sub*/*/*" --output "sub*/*/*/*" --output "sub*/*/*/*/*" ...

And I still got permission denied errors, despite those specific files appearing in the _outputs_array in spec.yaml. Any ideas for what else could cause this? Or is it possible the files were recognized as outputs but didn't successfully get unlocked for some reason?

yarikoptic commented 2 years ago

Yeah, "something like that". I will try to reproduce with some minimal reproducer. Also on which tacc cluster it is - I will see if I can access/check out out in place

jbwexler commented 2 years ago

Sounds good thanks, it's on Frontera.

jbwexler commented 2 years ago

Were you able to reproduce this? Or have any other ideas? It seems to happen anytime I try to rerun fmriprep/mriqc on subject that has been partially run. And as I mentioned, it happens even if the specific file mentioned in the permissiondenied error is specifically listed as an output in spec.yaml.

yarikoptic commented 2 years ago

sorry, didn't have chance yet. Will try to do tomorrow or early next week. Thank you for the reminder!

jbwexler commented 2 years ago

So I tried git annex find --unlocked in the dataset while the job was running to see if the files were getting successfully unlocked. Interestingly, when I first checked for unlocked files (around 6m into the job), only a small number of files were printed. But each time I checked, the list grew. Meanwhile, looking at the logs, fmriprep was running this whole time. I'm guessing this is because files only get unlocked as needed? Perhaps for some reason certain calls that fmriprep is making aren't being recognized by reproman and thus certain files aren't being unlocked?

I should also note that while I only waited 30m before cancelling the job, the files that gave permission denied errors never showed up in the list of unlocked files. But after the job ended, I tried unlocking a few of these files manually using datalad unlock, which worked successfully without any issues.

Let me know if have any other thoughts on this or ideas to test.

jbwexler commented 2 years ago

Here is the reproman command:

reproman run -r local --sub slurm --orc datalad-no-remote \
                --bp sub="$sub_list" --output . \
                    --jp num_processes="$processes" --jp num_nodes="$nodes" \
                        --jp walltime="$walltime" --jp queue="$queue" --jp launcher=true \
                            --jp job_name="${raw_ds}-${software}" --jp mail_type=END --jp mail_user="$user_email" \
                                --jp "container=code/containers/bids-${software}" --jp \
                                    killjob_factors="$killjob_factors" sourcedata/raw \
                                        "$derivatives_path" participant --participant-label '{p[sub]}' \
                                            -w "$work_dir/${raw_ds}_sub-{p[sub]}" -vv "${command[@]}"

            command=("--output-spaces" "MNI152NLin2009cAsym:res-2" "anat" "func" "fsaverage5" "--nthreads" "14" \
                "--omp-nthreads" "7" "--skip-bids-validation" "--notrack" "--fs-license-file" "$fs_license" \
                    "--use-aroma" "--ignore" "slicetiming" "--output-layout" "bids" "--cifti-output" "--resource-monitor" \
                        "--skull-strip-t1w" "$skull_strip" "--mem_mb" "$mem_mb" "--bids-database-dir" "/tmp")
yarikoptic commented 2 years ago

the actual underlying issue is #546 in that reproman doesn't even try to unlock outputs!

The "dirty/easy" workaround -- remove outputs you expect to be recomputed and save the result before running reproman run. cons: freesurfer outputs are "expensive" and nipype/fmriprep can just load prior results from them so we do not want to remove them, so not acceptable.

More scalable workaround -- make "tricky" files either go directly to git (scripts etc) or commit unlocked (all the .nii.gz which gave us trouble) in annex, so they are present not as symlinks but as regular files (copies). this is the winner so far

More datalad-aware nipype and fmriprep -- in more cases to replace regular open('w') with unlink first and then open('w'). may be later

related:

discovered oddity: annex.addunlocked probably can't be specified in .gitattributes and can only be in .git/config (thus not persist across clones) https://git-annex.branchable.com/todo/annex.addunlocked_in_gitattributes/? edit: git annex config could be used to set it "persistently" across clones (config which is stored in git-annex branch)

jbwexler commented 2 years ago

I was able to solve this using the "More scalable workaround ". Essentially the following: 1) Delete all fmriprep results from dataset pertaining to subjects that will be rerun. 2) Make sure any files shared between subjects (ie dataset_description.json) are in git or unlocked 3) Make sure not to delete the fmriprep work_dir located outside of dataset 4) datalad unlock any "tricky" files, just freesurfer scripts in my case 5) Git commit these unlocked files (datalad save will not work as it will lock any unlocked files before commit) 6) Rerun fmriprep on relevant subjects 7) git-annex lock to relock any unlocked files