ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

Permission denied on TACC when copying data over #578

Open jdkent opened 3 years ago

jdkent commented 3 years ago

reproman is looking awesome, it's so cool to be able to submit a job from my local machine to a HPC.

I ran into a couple snags when running the following code (using the reproman master branch) on TACC (lonestar5):

import os
from textwrap import dedent

import datalad.api as dapi
import reproman.api  as rapi

##########################
# Create lonestar resource
##########################
USERNAME = 'jdk3232'
KEY_FILENAME = os.path.join(os.path.expanduser("~"), ".ssh", "id_rsa")
ls_params = {
    'user': USERNAME,
    'key_filename': KEY_FILENAME,
    'host': 'ls5.tacc.utexas.edu'
}

if not any(['lonestar5' in resource[0] for resource in rapi.ls().values()]):
    rapi.create(name="lonestar5", resource_type="ssh", backend_parameters=ls_params)

################
# Create dataset
################
if os.path.isdir('./example'):
    dataset = dapi.Dataset('./example')
else:
    dataset = dapi.create("./example")
    sub_dataset = dapi.create("./output", dataset=dataset)
    dataset.add_readme()
    # create script
    script = "mkdir -p output && pwd > output/pwd.txt"
    with open("./example/script.sh", "w+") as sc:
        sc.write(script)
    os.chmod("./example/script.sh", 0o777)
    dataset.save()

##############
# Run reproman
##############

jps = {
    "num_nodes": 1,
    "launcher": 'true',
    "queue": "normal",
    "num_processes": 1,
    "walltime": 1,

}

os.chdir('./example')
rapi.run(
    command=['./script.sh'],
    resref="lonestar5",
    submitter="slurm",
    orchestrator="datalad-local-run",
    job_parameters=jps,
    inputs=["script.sh"],
    outputs=["output/pwd.txt"],
    follow=True,
    )

# remove example directory
# datalad remove -d example --nocheck -r ./example

Snags

kyleam commented 3 years ago

Thanks for the feedback.

I cannot run this code twice because I get a permissions error when copying the data over.

Presumably you see the same error if you use fabric directly to copy into the run-root directory shown in the output above.

python -c 'from fabric import Connection; Connection("slurm").put("foo", "/path/to/run-root/")'

Do you see the same if you use sftp or scp to copy the file into the run root?

If it's a general permissions issue, I'm not sure there's much to do aside from tell reproman run to use a different location for root_directory.

I need to make the output directory manually in my script since it is not copied over.

Hmm, the current state of leaving it to scripts to ensure that output directories exist seems okay to me, though I think it'd probably be fine for the prepare_remote method of orchestrator classes to create them.

the output suggests stderr and stdout should have the suffix of the job array (e.g., 0, 1, 2, 3), but I get another number instead (e.g., stderr.4294967294)

Thanks for noticing that. That looks to be an interaction with the recently added launcher support. I don't know that we can get per-subjob output files in that case, but an accurate file name should at least be reported.

jdkent commented 3 years ago

The permission does indeed persist with fabric and scp. It does look possible to change the mode of the file when copying so it could be overwritable later.

This may be a large todo, but I'm curious if existing remote files and local files could be hashed, and only copied over if they changed (when using singularity containers, it would be nice to only have to copy them over once).

looks like it's possible to change permission on remote file: https://github.com/fabric/fabric/blob/35d7662ee020e8de236577a17571f1428c102479/fabric/transfer.py#L318

and hashing a file can be done in chunks as to not take too much memory: https://stackoverflow.com/questions/22058048/hashing-a-file-in-python or it looks like you could try to run a shell command on the remote machine like sha1sum and compare that with the local file. (and if the remote machine does not have that command, just assume files are different and copy them over).

smaller ask: chmod the remote file (if it exists) so it can be overwritten.

larger ask: hash local and remote file (if it exists) and overwrite if local is different.

kyleam commented 3 years ago

Both of your suggestions sound like good ideas to me.

smaller ask: chmod the remote file (if it exists) so it can be overwritten.

I think it'd be fine for the plain and datalad-local-run orchestrators to ensure that files have write permissions right after being copied, though I'd prefer not to touch files that are already on the remote. That'd solve the problem going forward, but existing locations would of course need to be adjusted manually.

larger ask: hash local and remote file (if it exists) and overwrite if local is different.

For the plain and datalad-local-run orchestrators, this sounds good too. And the local and remote sizes can be compared to avoid hashing in a subset of cases.

For the other orchestrators, the target location is a Git repository, and git-annex/DataLad handles these details. I'm guessing you're using datalad-local-run because you don't have git-annex available on the remote, but if that's not the case, I'd recommend you use datalad-pair-run or datalad-pair-run.