Open jdkent opened 3 years ago
Thanks for the feedback.
I cannot run this code twice because I get a permissions error when copying the data over.
Presumably you see the same error if you use fabric
directly to copy into the run-root directory shown in the output above.
python -c 'from fabric import Connection; Connection("slurm").put("foo", "/path/to/run-root/")'
Do you see the same if you use sftp
or scp
to copy the file into the run root?
If it's a general permissions issue, I'm not sure there's much to do aside from tell reproman run
to use a different location for root_directory
.
I need to make the
output
directory manually in my script since it is not copied over.
Hmm, the current state of leaving it to scripts to ensure that output directories exist seems okay to me, though I think it'd probably be fine for the prepare_remote
method of orchestrator classes to create them.
the output suggests stderr and stdout should have the suffix of the job array (e.g., 0, 1, 2, 3), but I get another number instead (e.g.,
stderr.4294967294
)
Thanks for noticing that. That looks to be an interaction with the recently added launcher support. I don't know that we can get per-subjob output files in that case, but an accurate file name should at least be reported.
The permission does indeed persist with fabric
and scp
.
It does look possible to change the mode of the file when copying so it could be overwritable later.
This may be a large todo, but I'm curious if existing remote files and local files could be hashed, and only copied over if they changed (when using singularity containers, it would be nice to only have to copy them over once).
looks like it's possible to change permission on remote file: https://github.com/fabric/fabric/blob/35d7662ee020e8de236577a17571f1428c102479/fabric/transfer.py#L318
and hashing a file can be done in chunks as to not take too much memory: https://stackoverflow.com/questions/22058048/hashing-a-file-in-python
or it looks like you could try to run a shell command on the remote machine like sha1sum
and compare that with the local file. (and if the remote machine does not have that command, just assume files are different and copy them over).
smaller ask: chmod the remote file (if it exists) so it can be overwritten.
larger ask: hash local and remote file (if it exists) and overwrite if local is different.
Both of your suggestions sound like good ideas to me.
smaller ask: chmod the remote file (if it exists) so it can be overwritten.
I think it'd be fine for the plain
and datalad-local-run
orchestrators to ensure that files have write permissions right after being copied, though I'd prefer not to touch files that are already on the remote. That'd solve the problem going forward, but existing locations would of course need to be adjusted manually.
larger ask: hash local and remote file (if it exists) and overwrite if local is different.
For the plain
and datalad-local-run
orchestrators, this sounds good too. And the local and remote sizes can be compared to avoid hashing in a subset of cases.
For the other orchestrators, the target location is a Git repository, and git-annex/DataLad handles these details. I'm guessing you're using datalad-local-run
because you don't have git-annex available on the remote, but if that's not the case, I'd recommend you use datalad-pair-run
or datalad-pair-run
.
reproman is looking awesome, it's so cool to be able to submit a job from my local machine to a HPC.
I ran into a couple snags when running the following code (using the reproman master branch) on TACC (lonestar5):
Snags
[ ] I cannot run this code twice because I get a permissions error when copying the data over.
[ ] I need to make the
output
directory manually in my script since it is not copied over.[ ] (small thing) the output suggests stderr and stdout should have the suffix of the job array (e.g., 0, 1, 2, 3), but I get another number instead (e.g.,
stderr.4294967294
)