DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Toil misses the Name tag on the AWS leader that toil launch-cluster deployed #2574

Closed ngocnguyen closed 1 year ago

ngocnguyen commented 5 years ago

Hello, I have a cwl that works fine with toil-cwl-runner on my local machine, but it wouldn't run in AWS using quay.io/ucsc_cgl/toil:3.18.0 This cwl need a directory and it mounts fine on my local machine, but that directory is empty when running in AWS.

adamnovak commented 5 years ago

Hello, @ngocnguyen,

Thanks for reporting this.

Can you elaborate on what you mean by needing a directory? Can you possibly provide the CWL file that is causing the problem?

If you have a directory full of files on your local machine, and you want to run a CWL workflow in Toil on AWS that makes use of those files, you probably have to copy those files up to the cloud. If you created your AWS cluster with toil launch-cluster, connected with toil ssh-cluster, and ran your workflow there without copying any files, the workflow would be looking for the directory of files on that cluster head node you just created; it has no way to connect back to your local workstation to get at the directory there.

The Right Way to solve that would be to upload your directory of files to S3 with the aws s3 cp command, and then point your workflow at the data in S3; you may be able to replace file:// IRIs in the input parameters to the CWL workflow with s3:// IRIs to accomplish that.

You could also copy the directory to the Toil cluster head node, with toil rsync-cluster.

ngocnguyen commented 5 years ago

Hello, @adamnovak,

Thanks for your reply. I have a simple test CWL just for this case, and you only need 1 file "example.fasta" on the current directory and some files in the "sequence" directory to run my test CWL. The output file will be the output of the command "ls sequence". Running on my local machine, the output will list the files in "sequence" directory and running on the AWS cluster instance, the output file will be empty.

======================== ls-dir.yml.txt

ls-dir.cwl.txt

I have another problem about S3 since you mentioned it. I have a CWL that works fine with a local file, but it doesn't when I use an s3 URL as input file. The error said the input file is a directory:

WARNING:toil.leader:w/T/jobkgjSNB INFO:cwltool:[job psmfile_filter.cwl] /tmp/tmpnEpHf7/3/9/out_tmpdir1ITVj5$ docker \ WARNING:toil.leader:w/T/jobkgjSNB run \ WARNING:toil.leader:w/T/jobkgjSNB -i \ WARNING:toil.leader:w/T/jobkgjSNB --volume=/tmp/tmpnEpHf7/3/9/out_tmpdir1ITVj5:/var/spool/cwl:rw \ WARNING:toil.leader:w/T/jobkgjSNB --volume=/home/ngoc/workspace/Toil/Study-125/tmp:/tmp:rw \ WARNING:toil.leader:w/T/jobkgjSNB --volume=//s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv:/var/spool/cwl/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv:ro \ WARNING:toil.leader:w/T/jobkgjSNB --workdir=/var/spool/cwl \ WARNING:toil.leader:w/T/jobkgjSNB --read-only=true \ WARNING:toil.leader:w/T/jobkgjSNB --user=1000:1000 \ WARNING:toil.leader:w/T/jobkgjSNB --rm \ WARNING:toil.leader:w/T/jobkgjSNB --env=TMPDIR=/tmp \ WARNING:toil.leader:w/T/jobkgjSNB --env=HOME=/var/spool/cwl \ WARNING:toil.leader:w/T/jobkgjSNB cptacdcc/psmfile_filter \ WARNING:toil.leader:w/T/jobkgjSNB /var/spool/cwl/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv \ WARNING:toil.leader:w/T/jobkgjSNB '' \ WARNING:toil.leader:w/T/jobkgjSNB 01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.psm \ WARNING:toil.leader:w/T/jobkgjSNB 1 \ WARNING:toil.leader:w/T/jobkgjSNB tmt10 WARNING:toil.leader:w/T/jobkgjSNB Traceback (most recent call last): WARNING:toil.leader:w/T/jobkgjSNB File "/home/biodocker/bin/cptactools/psmfile/psmfile_filter.py", line 37, in WARNING:toil.leader:w/T/jobkgjSNB inrows = csv.DictReader(open(infile),dialect='excel-tab') WARNING:toil.leader:w/T/jobkgjSNB IOError: [Errno 21] Is a directory: '/var/spool/cwl/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv'

adamnovak commented 5 years ago

On the AWS instance (under toil ssh-cluster), when you run ls sequence, you get the same result of nothing being there, right? If that's the case, it's expected behavior for the CWL workflow to not be able to see anything there either. You probably need to make the directory on the AWS instance, and fill it with the files that you need to be there. The toil rsync-cluster command can help you copy files over, or you can just re-download them.


Your second problem looks to be caused by Toil trying to mount the nonexistent local file //s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv into a Docker container. When you tell Docker to mount something that doesn't (yet) exist, it mounts a new empty directory.

According to the CWL spec's description of File objects, a File has one of a location, which is a URL to read the file from, a path, which is a local machine path to read the file from, or a contents, which is just the actual file data. Can you post the CWL and/or command you are using to try and point at the file in S3 here? It sounds like you may be sending the URL as path when it needs to be location.

ngocnguyen commented 5 years ago

No, I copy all files from my local machine to the AWS instance with toil rsync-cluster command. Then on the AWS instance (under toil ssh-cluster), when I run "ls sequence" all files are there. But, the CWL workflow list it as empty.

I did use the "location" for the input-file.

infile_tsv: class: File location: "https://s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv"

ngocnguyen commented 5 years ago

@adamnovak I just launched a new cluster with quay.io/ucsc_cgl/toil:3.19.0 (new version that just released 3 days ago) and it fixed my S3 URL problem. But, it didn't fix my empty directory. Did you have a chance to run my ls-dir.cwl yet? Thanks, Ngoc

adamnovak commented 5 years ago

@DailyDreaming Does the Toil CWL runner have the necessary machinery to bring directories from the Toil master to cluster nodes when CWL workflows take the directories as inputs? If not, then when the jobs run on the workers, they would be looking on the worker local filesystem instead.

ngocnguyen commented 5 years ago

@adamnovak @DailyDreaming Do you have an answer for this problem? Like I said, I can run this CWL workflow fine on a local machine, but it wouldn't run on an AWS toil master instance alone (not using any cluster nodes yet). Thanks, Ngoc

adamnovak commented 5 years ago

Maybe this has to do with the toil launch-cluster-provided machine really being a Docker container, and the home directory presumably not itself being mounted from the host. We might be getting local filesystem paths mounted from the host and thus empty, when we wanted them to be mounted across from the Toil appliance container, which isn't possible.

Can you give me the exact commands to run to run your ls-dir workflow and see it not work with the error you describe? I tried running this:

TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.19.0 toil launch-cluster amntest2 --keyPairName anovak@kolossus --leaderNodeType t2.micro --zone us-west-2a
toil ssh-cluster -z us-west-2a amntest2

wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt

mkdir sequence
touch sequence/lol

mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl

toil-cwl-runner ls-dir.cwl ls-dir.yml

And I got this other unrelated error:

Traceback (most recent call last):
  File "/usr/local/bin/toil-cwl-runner", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 1188, in main
    loading_context.overrides_list, tool_file_uri)
  File "/usr/local/lib/python2.7/dist-packages/cwltool/main.py", line 251, in load_job_order
    job_order_object, _ = loader.resolve_ref(job_order_file, checklinks=False)
  File "/usr/local/lib/python2.7/dist-packages/schema_salad/ref_resolver.py", line 590, in resolve_ref
    doc = self.fetch(doc_url, inject_ids=(not mixin))
  File "/usr/local/lib/python2.7/dist-packages/schema_salad/sourceline.py", line 168, in __exit__
    raise self.makeError(six.text_type(exc_value))
RuntimeError: [Errno 2] No such file or directory: ''
ngocnguyen commented 5 years ago

@adamnovak Yes, I got the same error with your exact steps. However, I can run my ls-dir workflow with a few modification from your steps.

  1. Using t2.medium instead of t2.micro (doesn't have enough memory)
  2. Create /data then do your steps in /data instead of in /
  3. The workflow also needs example.fasta file, you need to do 'touch example.fasta' in /data directory.

Thanks, Ngoc

ngocnguyen commented 5 years ago

@adamnovak Are you able to run my ls-dir workflow? And did you try to run it on your local machine?

adamnovak commented 5 years ago

OK, I got it to work.

The Toil master you ssh into is really a Docker container inside a host system. When Toil calls into Docker, it runs those containers as siblings of the container Toil is in, and not as children.

This means that any directories you try to mount into the Docker containers you run come from the host, and not the system where the Toil leader is running. So when this workflow uses a Docker container to list the contents of a directory, it does it on the host system where the directory doesn't exist. Docker makes nonexistent directories you mount as new empty directories, so the container sees an empty directory.

If you docker inspect toil_leader, you can see under Mounts that /tmp on the host is mounted to /tmp int teh Toil container. So any Docker mounts of files and directories in /tmp will work as expected.

So, you should work in /tmp/data instead of /data, and it should work. Toil itself makes all the directories it wants to pass along to Docker containers in /tmp, but if you try and ship along a directory to a container via CWL you can expose this sibling container relationship.

I ran these commands on a Toil leader node and got the expected output of "lol". Note that I had to shrink the default memory for the jobs because my leader node also somehow had only 1 GB of memory.

cd /tmp

mkdir data
cd data
echo ">seq" > example.fasta
echo "GATTACA" >> example.fasta

wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt

mkdir sequence
touch sequence/lol

mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl

toil-cwl-runner --defaultMemory 100M ls-dir.cwl ls-dir.yml

cat ls-dir-output.txt 
ngocnguyen commented 5 years ago

Great, I'll give it a try. Thanks Adam, Ngoc

mr-c commented 5 years ago

Thank you @adamnovak for tracking this down. Can the docs be clarified about this issue?

adamnovak commented 5 years ago

I guess this could become another tip in the "Tips" box in https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-cwl-workflow-on-aws

Something along the lines of, if you use the single machine batch system on a cluster head node with a CWL workflow that in turn uses Docker, input/output from/to the local filesystem may not behave as expected, and that you should try using paths under /tmp to work around it.

ngocnguyen commented 5 years ago

@adamnovak Please note the size of /tmp is 2G and we cannot run with large dataset. Thanks, Ngoc

adamnovak commented 5 years ago

What's the use case for the cluster head node deployed by toil launch-cluster being usable to run a large data set in single machine mode?

If you are running with the Mesos batch system and AWS job store, like the toil launch-cluster cluster is intended to be used, then all the jobs will run on the worker nodes, and no jobs will run on the cluster head node. So the head node only needs minimal resources.

If you have a lot of data and you want to process it on a single machine, on AWS with the single machine batch system, you shouldn't use toil launch-cluster. Instead, you should just launch a normal non-Toil-managed EC2 instance, and install Toil from pip, and Docker, if you need it, from your distribution's package manager (making sure to set it up to be usable by the user you want to use to run workflows).

Is there a particular reason why you want to be running on the Toil-deployed cluster leader in single-machine mode, rather than on your own machine? Is it just that setup is simpler?

On 4/15/19, ngocnguyen notifications@github.com wrote:

@adamnovak Please note the size of /tmp is 2G and we cannot run with large dataset. Thanks, Ngoc

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/2574#issuecomment-483380131

ngocnguyen commented 5 years ago

@adamnovak I have a lot of files that I want to run through my CWL workflow. This is my first step to try to run it on AWS, and we are running into this problem. The document doesn't show me how to run CWL workflow on the worker nodes, can you point me to the right direction? Can you show me how to run the ls-dir CWL workflow on a worker node? Thanks, Ngoc

adamnovak commented 5 years ago

I wanted to point you at https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl but I think that that part of the docs is actually wrong. It shows you doing something like this:

toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml

But with just --provisioner aws I think it is going to try and use the single machine batch system with the AWS provisioner, which will never be called upon to provision anything, and the workflow will just run on the leader node.

You probably want the instructions here instead: https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus

It gives a command like:

cactus --provisioner <aws, gce, azure> --nodeType <type> --maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>:<zone>:cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal

You need --batchSystem mesos to point Toil at the Mesos installation that actually distributes jobs over the cluster, and you need --provisioner and --nodeTypes (which the example has as --nodeType for some reason) to actually have any nodes created. I've also always passed --mesosMaster=$(hostname -i):5050 to point explicitly at Mesos. And it's always good to set min and max nodes.

I think you want to blend all this together for something like, for your workflow:

toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10  --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml

That will make the cluster scale up from 0 to a max of 10 nodes as needed, using i3.xlarge nodes, and keeping job state in an AWS job store in us-west-2a named "lsdirstore1".

I'm not sure whether the CWL interpreter is smart enough to pack up your directory you are trying to list and ship it to the node that is going to try to list it. You might end up looking at that node's local filesystem instead. You might need to move your input directory from file:// to s3://.

This really needs to be added to the docs. We don't show CWL with autoscaling and Mesos right now as far as I can tell.

ngocnguyen commented 5 years ago

Hi Adam, This is the error I got when running your suggested command, can you help? Thanks, Ngoc

(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes t2.medium --minNodes 0 --maxNodes 10 --jobStore aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl' WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to physically available memory (4137283584). WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically available disk (35568545792). Traceback (most recent call last): File "/usr/local/bin/toil-cwl-runner", line 11, in sys.exit(main()) File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 1274, in main outobj = toil.start(wf1) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 770, in start self._setProvisioner() File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 815, in _setProvisioner sseKey=self.config.sseKey) File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/init.py", line 33, in clusterFactory return AWSProvisioner(clusterName, zone, nodeStorage, sseKey) File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 105, in init self._readClusterSettings() File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 116, in _readClusterSettings self.clusterName = str(instance.tags["Name"]) KeyError: 'Name' (venv) root@ip-172-31-19-65:/data#

On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:

I wanted to point you at https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl but I think that that part of the docs is actually wrong. It shows you doing something like this:

toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml

But with just --provisioner aws I think it is going to try and use the single machine batch system with the AWS provisioner, which will never be called upon to provision anything, and the workflow will just run on the leader node.

You probably want the instructions here instead: https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus

It gives a command like:

cactus --provisioner <aws, gce, azure> --nodeType --maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>::cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal

You need --batchSystem mesos to point Toil at the Mesos installation that actually distributes jobs over the cluster, and you need --provisioner and --nodeTypes (which the example has as --nodeType for some reason) to actually have any nodes created. I've also always passed --mesosMaster=$(hostname -i):5050 to point explicitly at Mesos. And it's always good to set min and max nodes.

I think you want to blend all this together for something like, for your workflow:

toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml

That will make the cluster scale up from 0 to a max of 10 nodes as needed, using i3.xlarge nodes, and keeping job state in an AWS job store in us-west-2a named "lsdirstore1".

I'm not sure whether the CWL interpreter is smart enough to pack up your directory you are trying to list and ship it to the node that is going to try to list it. You might end up looking at that node's local filesystem instead. You might need to move your input directory from file:// to s3://.

This really needs to be added to the docs. We don't show CWL with autoscaling and Mesos right now as far as I can tell.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493, or mute the thread https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA .

adamnovak commented 5 years ago

Hm. That looks like your AWS instance is missing a Name tag. In the AWS console, does it in fact have a Name tag?

The Name tag should have been set by toil launch-cluster to the cluster name you gave it (see https://github.com/DataBiosphere/toil/blob/d62429fa29f71df755d3310c802717c5d89b203e/src/toil/provisioners/aws/awsProvisioner.py#L162). If it has gotten lost somehow, you could probably recreate it through the console, or just tear down and redeploy the cluster.

On 4/18/19, ngocnguyen notifications@github.com wrote:

Hi Adam,

This is the error I got when running your suggested command, can you help?

Thanks,

Ngoc

(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws

--batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes

t2.medium --minNodes 0 --maxNodes 10 --jobStore

aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml

INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl'

WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to

physically available memory (4137283584).

WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically

available disk (35568545792).

Traceback (most recent call last):

File "/usr/local/bin/toil-cwl-runner", line 11, in

sys.exit(main())

File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py",

line 1274, in main

outobj = toil.start(wf1)

File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line

770, in start

self._setProvisioner()

File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line

815, in _setProvisioner

sseKey=self.config.sseKey)

File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/init.py",

line 33, in clusterFactory

return AWSProvisioner(clusterName, zone, nodeStorage, sseKey)

File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",

line 105, in init

self._readClusterSettings()

File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",

line 116, in _readClusterSettings

self.clusterName = str(instance.tags["Name"])

KeyError: 'Name'

(venv) root@ip-172-31-19-65:/data#

On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:

I wanted to point you at

https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl

but I think that that part of the docs is actually wrong. It shows you

doing something like this:

toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml

But with just --provisioner aws I think it is going to try and use the

single machine batch system with the AWS provisioner, which will never be

called upon to provision anything, and the workflow will just run on the

leader node.

You probably want the instructions here instead:

https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus

It gives a command like:

cactus --provisioner <aws, gce, azure> --nodeType --maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>::cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal

You need --batchSystem mesos to point Toil at the Mesos installation that

actually distributes jobs over the cluster, and you need --provisioner

and --nodeTypes (which the example has as --nodeType for some reason) to

actually have any nodes created. I've also always passed --mesosMaster=$(hostname

-i):5050 to point explicitly at Mesos. And it's always good to set min

and max nodes.

I think you want to blend all this together for something like, for your

workflow:

toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml

That will make the cluster scale up from 0 to a max of 10 nodes as needed,

using i3.xlarge nodes, and keeping job state in an AWS job store in

us-west-2a named "lsdirstore1".

I'm not sure whether the CWL interpreter is smart enough to pack up your

directory you are trying to list and ship it to the node that is going to

try to list it. You might end up looking at that node's local filesystem

instead. You might need to move your input directory from file:// to s3://.

This really needs to be added to the docs. We don't show CWL with

autoscaling and Mesos right now as far as I can tell.

You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub

https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA

.

--

You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub:

https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630

ngocnguyen commented 5 years ago

Yes, my AWS instance has a Name tag.

On Thu, Apr 18, 2019 at 1:18 PM Adam Novak notifications@github.com wrote:

Hm. That looks like your AWS instance is missing a Name tag. In the AWS console, does it in fact have a Name tag?

The Name tag should have been set by toil launch-cluster to the cluster name you gave it (see

https://github.com/DataBiosphere/toil/blob/d62429fa29f71df755d3310c802717c5d89b203e/src/toil/provisioners/aws/awsProvisioner.py#L162 ). If it has gotten lost somehow, you could probably recreate it through the console, or just tear down and redeploy the cluster.

On 4/18/19, ngocnguyen notifications@github.com wrote:

Hi Adam,

This is the error I got when running your suggested command, can you help?

Thanks,

Ngoc

(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws

--batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes

t2.medium --minNodes 0 --maxNodes 10 --jobStore

aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml

INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl'

WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to

physically available memory (4137283584).

WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically

available disk (35568545792).

Traceback (most recent call last):

File "/usr/local/bin/toil-cwl-runner", line 11, in

sys.exit(main())

File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py",

line 1274, in main

outobj = toil.start(wf1)

File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line

770, in start

self._setProvisioner()

File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line

815, in _setProvisioner

sseKey=self.config.sseKey)

File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/init.py",

line 33, in clusterFactory

return AWSProvisioner(clusterName, zone, nodeStorage, sseKey)

File

"/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",

line 105, in init

self._readClusterSettings()

File

"/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",

line 116, in _readClusterSettings

self.clusterName = str(instance.tags["Name"])

KeyError: 'Name'

(venv) root@ip-172-31-19-65:/data#

On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:

I wanted to point you at

https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl

but I think that that part of the docs is actually wrong. It shows you

doing something like this:

toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml

But with just --provisioner aws I think it is going to try and use the

single machine batch system with the AWS provisioner, which will never be

called upon to provision anything, and the workflow will just run on the

leader node.

You probably want the instructions here instead:

https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus

It gives a command like:

cactus --provisioner <aws, gce, azure> --nodeType --maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>::cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal

You need --batchSystem mesos to point Toil at the Mesos installation that

actually distributes jobs over the cluster, and you need --provisioner

and --nodeTypes (which the example has as --nodeType for some reason) to

actually have any nodes created. I've also always passed --mesosMaster=$(hostname

-i):5050 to point explicitly at Mesos. And it's always good to set min

and max nodes.

I think you want to blend all this together for something like, for your

workflow:

toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml

That will make the cluster scale up from 0 to a max of 10 nodes as needed,

using i3.xlarge nodes, and keeping job state in an AWS job store in

us-west-2a named "lsdirstore1".

I'm not sure whether the CWL interpreter is smart enough to pack up your

directory you are trying to list and ship it to the node that is going to

try to list it. You might end up looking at that node's local filesystem

instead. You might need to move your input directory from file:// to s3://.

This really needs to be added to the docs. We don't show CWL with

autoscaling and Mesos right now as far as I can tell.

You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub

< https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493>,

or mute the thread

< https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA

.

--

You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub:

https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484598356, or mute the thread https://github.com/notifications/unsubscribe-auth/AALSQMRB4BOSH2JLIB7HWLDPRCUNHANCNFSM4HB2JJBA .

adamnovak commented 5 years ago

I'm not able to reproduce this issue. Here's what I did:

pip uninstall toil
pip install --user toil[aws]

TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.19.0 toil launch-cluster amntest2 --keyPairName anovak@kolossus --leaderNodeType t2.micro --zone us-west-2a

toil ssh-cluster -z us-west-2a amntest2

cd /tmp
mkdir data
cd data

wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt

mkdir sequence
touch sequence/lol

echo ">seq" > example.fasta
echo "GATTACA" >> example.fasta

mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl

toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes t2.medium --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2:lsdirstore2 ls-dir.cwl ls-dir.yml

And here's the Toil output:

INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///tmp/data/ls-dir.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to physically available memory (1033838592).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically available disk (45382463488).
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:9jTEr8mxV54i7ChScoFipRmczJrnifx8tTQ6DB8Udi4 root@ip-172-31-23-143.us-west-2.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
| + .             |
|  * .o . .       |
| o =..+   o      |
|o * .E.. . .     |
|.= o. ..S + . .  |
|..o...o= = * o . |
|  o ++= o O o o  |
|   o.*.. o o .   |
|    ..... .      |
+----[SHA256]-----+
Agent pid 86
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
INFO:toil:Running Toil version 3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6.
INFO:toil.leader:Issued job 'file:///tmp/data/ls-dir.cwl' 63f92b9b-c184-4fc1-8572-7d212e1203eb with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
INFO:toil.provisioners.clusterScaler:Adding 1 non-preemptable nodes to get to desired cluster size of 1.
INFO:toil:Using default user-defined custom docker init command of  as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
INFO:toil:Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
INFO:toil:Using default docker name of toil as TOIL_DOCKER_NAME is not set.
INFO:toil:Overriding docker appliance of quay.io/ucsc_cgl/toil:3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6 with quay.io/ucsc_cgl/toil:3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6 from TOIL_APPLIANCE_SELF.
INFO:toil.lib.ec2:Creating t2.medium instance(s) ... 
INFO:toil.leader:Job ended successfully: 'file:///tmp/data/ls-dir.cwl' 63f92b9b-c184-4fc1-8572-7d212e1203eb
INFO:toil.provisioners.clusterScaler:Removing 1 non-preemptable nodes to get to desired cluster size of 0.
INFO:toil.provisioners.aws.awsProvisioner:Terminating instance(s): [u'i-0aab84ddf907fa0f9']
INFO:toil.provisioners.aws.awsProvisioner:Instance(s) terminated.
INFO:toil.leader:Finished toil run successfully.
{
    "example_out": {
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709", 
        "basename": "ls-dir-output.txt", 
        "nameext": ".txt", 
        "nameroot": "ls-dir-output", 
        "location": "file:///tmp/data/ls-dir-output.txt", 
        "class": "File", 
        "size": 0
    }
INFO:toil.common:Successfully deleted the job store: <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb98d582dd0>
}

And the contents of the output file:

root@ip-172-31-23-143:/tmp/data# cat ls-dir-output.txt 
root@ip-172-31-23-143:/tmp/data# 

So I got the workflow to run through successfully when running on the actual cluster nodes. The actual directory was still empty from the perspective of the workflow, but I suspect that's because I'm trying to use a filesystem directory on a cluster that doesn't have a shared filesystem across the nodes; if I want the nodes to be able to see things consistently, I need to point the workflow at S3.

When you set up the cluster, did you put it in a zone in us-east-1 where you put your job store? I tried running with the job store in a different region than my cluster, and I got a bunch of 409 Conflict errors, but maybe that's not the only failure mode.

ngocnguyen commented 5 years ago

@adamnovak I can point a file to S3 object, but how do I point a directory to S3? S3 doesn't have a directory concept. How do I manage jobStore? I have been testing with different parameters and toil said the jobStore already exist, but I can't delete it use toil clean .

adamnovak commented 5 years ago

You should be able to delete the job store with toil clean; it's something like toil clean aws:us-west-2:lsdirstore2. If that doesn't work, can you show what it does?

S3 treats common prefixes of objects with slashes in the name as pseudo-directories. I'm not sure if we implemented any support for it in Toil's CWL runner (@DailyDreaming might know), but if you store your files as s3://bucket/path/to/directory/file.whatever, then if you have s3://bucket/path/to/directory/ as a "directory" you can list all the files in the "directory" and add and remove files through the S3 API. There's just no concept of creating or destroying directories, or having empty directories, and you can also have data at the name corresponding to the "directory" itself.

ngocnguyen commented 5 years ago

Yeah, Toil'sCWL runner doesn't recognize s3://bucket/path/to/directory/ as directory with error "No such file or directory".

Traceback (most recent call last): File "/home/ngoc/venv-toil-all/bin/toil-cwl-runner", line 10, in sys.exit(main()) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1201, in main import_files(initialized_job_order) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1176, in import_files get_listing, fs_access, recursive=True)) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 51, in adjustDirObjs visit_class(rec, ("Directory",), op) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/cwltool/utils.py", line 225, in visit_class visit_class(rec[d], cls, op) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/cwltool/utils.py", line 223, in visit_class op(rec) File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 112, in get_listing for ld in fs_access.listdir(loc): File "/home/ngoc/venv-toil-all/local/lib/python2.7/site-packages/cwltool/stdfsaccess.py", line 55, in listdir return [abspath(urllib.parse.quote(str(l)), fn) for l in os.listdir(self._abs(fn))] OSError: [Errno 2] No such file or directory: 'https://s3.amazonaws.com/ngoclocal/sequence'

adamnovak commented 5 years ago

That should probably be another issue, to track developing that functionality. This one I think can keep representing the Name problem you described in https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630.

@arostamianfar, you run Toil CWL on AWS all the time; have you ever had a problem with the Directory type not being implemented? Or do you just not use it?

arostamianfar commented 5 years ago

@adamnovak I haven't used the Directory type yet, but looks like native s3 directory support is not implemented (native s3 file support was fixed in #2234 ). As a workaround, I currently use the mesosphere/aws-cli docker image and have a cp_to_s3.cwl task that copies files to/from s3 and I can control where/how to copy them.

adamnovak commented 1 year ago

I'm going to close this; I tried to reproduce the lack of a Name tag and couldn't, and then we moved on to talking about CWL stuff that also I think is now done.