Closed ngocnguyen closed 1 year ago
Hello, @ngocnguyen,
Thanks for reporting this.
Can you elaborate on what you mean by needing a directory? Can you possibly provide the CWL file that is causing the problem?
If you have a directory full of files on your local machine, and you want to run a CWL workflow in Toil on AWS that makes use of those files, you probably have to copy those files up to the cloud. If you created your AWS cluster with toil launch-cluster
, connected with toil ssh-cluster
, and ran your workflow there without copying any files, the workflow would be looking for the directory of files on that cluster head node you just created; it has no way to connect back to your local workstation to get at the directory there.
The Right Way to solve that would be to upload your directory of files to S3 with the aws s3 cp
command, and then point your workflow at the data in S3; you may be able to replace file://
IRIs in the input parameters to the CWL workflow with s3://
IRIs to accomplish that.
You could also copy the directory to the Toil cluster head node, with toil rsync-cluster
.
Hello, @adamnovak,
Thanks for your reply. I have a simple test CWL just for this case, and you only need 1 file "example.fasta" on the current directory and some files in the "sequence" directory to run my test CWL. The output file will be the output of the command "ls sequence". Running on my local machine, the output will list the files in "sequence" directory and running on the AWS cluster instance, the output file will be empty.
======================== ls-dir.yml.txt
I have another problem about S3 since you mentioned it. I have a CWL that works fine with a local file, but it doesn't when I use an s3 URL as input file. The error said the input file is a directory:
WARNING:toil.leader:w/T/jobkgjSNB INFO:cwltool:[job psmfile_filter.cwl] /tmp/tmpnEpHf7/3/9/out_tmpdir1ITVj5$ docker \
WARNING:toil.leader:w/T/jobkgjSNB run \
WARNING:toil.leader:w/T/jobkgjSNB -i \
WARNING:toil.leader:w/T/jobkgjSNB --volume=/tmp/tmpnEpHf7/3/9/out_tmpdir1ITVj5:/var/spool/cwl:rw \
WARNING:toil.leader:w/T/jobkgjSNB --volume=/home/ngoc/workspace/Toil/Study-125/tmp:/tmp:rw \
WARNING:toil.leader:w/T/jobkgjSNB --volume=//s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv:/var/spool/cwl/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv:ro \
WARNING:toil.leader:w/T/jobkgjSNB --workdir=/var/spool/cwl \
WARNING:toil.leader:w/T/jobkgjSNB --read-only=true \
WARNING:toil.leader:w/T/jobkgjSNB --user=1000:1000 \
WARNING:toil.leader:w/T/jobkgjSNB --rm \
WARNING:toil.leader:w/T/jobkgjSNB --env=TMPDIR=/tmp \
WARNING:toil.leader:w/T/jobkgjSNB --env=HOME=/var/spool/cwl \
WARNING:toil.leader:w/T/jobkgjSNB cptacdcc/psmfile_filter \
WARNING:toil.leader:w/T/jobkgjSNB /var/spool/cwl/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv \
WARNING:toil.leader:w/T/jobkgjSNB '' \
WARNING:toil.leader:w/T/jobkgjSNB 01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.psm \
WARNING:toil.leader:w/T/jobkgjSNB 1 \
WARNING:toil.leader:w/T/jobkgjSNB tmt10
WARNING:toil.leader:w/T/jobkgjSNB Traceback (most recent call last):
WARNING:toil.leader:w/T/jobkgjSNB File "/home/biodocker/bin/cptactools/psmfile/psmfile_filter.py", line 37, in
On the AWS instance (under toil ssh-cluster
), when you run ls sequence
, you get the same result of nothing being there, right? If that's the case, it's expected behavior for the CWL workflow to not be able to see anything there either. You probably need to make the directory on the AWS instance, and fill it with the files that you need to be there. The toil rsync-cluster
command can help you copy files over, or you can just re-download them.
Your second problem looks to be caused by Toil trying to mount the nonexistent local file //s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv
into a Docker container. When you tell Docker to mount something that doesn't (yet) exist, it mounts a new empty directory.
According to the CWL spec's description of File
objects, a File
has one of a location
, which is a URL to read the file from, a path
, which is a local machine path to read the file from, or a contents
, which is just the actual file data. Can you post the CWL and/or command you are using to try and point at the file in S3 here? It sounds like you may be sending the URL as path
when it needs to be location
.
No, I copy all files from my local machine to the AWS instance with toil rsync-cluster command. Then on the AWS instance (under toil ssh-cluster), when I run "ls sequence" all files are there. But, the CWL workflow list it as empty.
I did use the "location" for the input-file.
infile_tsv: class: File location: "https://s3.amazonaws.com/ngoclocal/01CPTAC_CompRef_UCEC_W_PNNL_20170922_B1S5_f01.tsv"
@adamnovak I just launched a new cluster with quay.io/ucsc_cgl/toil:3.19.0 (new version that just released 3 days ago) and it fixed my S3 URL problem. But, it didn't fix my empty directory. Did you have a chance to run my ls-dir.cwl yet? Thanks, Ngoc
@DailyDreaming Does the Toil CWL runner have the necessary machinery to bring directories from the Toil master to cluster nodes when CWL workflows take the directories as inputs? If not, then when the jobs run on the workers, they would be looking on the worker local filesystem instead.
@adamnovak @DailyDreaming Do you have an answer for this problem? Like I said, I can run this CWL workflow fine on a local machine, but it wouldn't run on an AWS toil master instance alone (not using any cluster nodes yet). Thanks, Ngoc
Maybe this has to do with the toil launch-cluster
-provided machine really being a Docker container, and the home directory presumably not itself being mounted from the host. We might be getting local filesystem paths mounted from the host and thus empty, when we wanted them to be mounted across from the Toil appliance container, which isn't possible.
Can you give me the exact commands to run to run your ls-dir workflow and see it not work with the error you describe? I tried running this:
TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.19.0 toil launch-cluster amntest2 --keyPairName anovak@kolossus --leaderNodeType t2.micro --zone us-west-2a
toil ssh-cluster -z us-west-2a amntest2
wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt
mkdir sequence
touch sequence/lol
mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl
toil-cwl-runner ls-dir.cwl ls-dir.yml
And I got this other unrelated error:
Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 1188, in main
loading_context.overrides_list, tool_file_uri)
File "/usr/local/lib/python2.7/dist-packages/cwltool/main.py", line 251, in load_job_order
job_order_object, _ = loader.resolve_ref(job_order_file, checklinks=False)
File "/usr/local/lib/python2.7/dist-packages/schema_salad/ref_resolver.py", line 590, in resolve_ref
doc = self.fetch(doc_url, inject_ids=(not mixin))
File "/usr/local/lib/python2.7/dist-packages/schema_salad/sourceline.py", line 168, in __exit__
raise self.makeError(six.text_type(exc_value))
RuntimeError: [Errno 2] No such file or directory: ''
@adamnovak Yes, I got the same error with your exact steps. However, I can run my ls-dir workflow with a few modification from your steps.
Thanks, Ngoc
@adamnovak Are you able to run my ls-dir workflow? And did you try to run it on your local machine?
OK, I got it to work.
The Toil master you ssh into is really a Docker container inside a host system. When Toil calls into Docker, it runs those containers as siblings of the container Toil is in, and not as children.
This means that any directories you try to mount into the Docker containers you run come from the host, and not the system where the Toil leader is running. So when this workflow uses a Docker container to list the contents of a directory, it does it on the host system where the directory doesn't exist. Docker makes nonexistent directories you mount as new empty directories, so the container sees an empty directory.
If you docker inspect toil_leader
, you can see under Mounts
that /tmp
on the host is mounted to /tmp
int teh Toil container. So any Docker mounts of files and directories in /tmp
will work as expected.
So, you should work in /tmp/data
instead of /data
, and it should work. Toil itself makes all the directories it wants to pass along to Docker containers in /tmp
, but if you try and ship along a directory to a container via CWL you can expose this sibling container relationship.
I ran these commands on a Toil leader node and got the expected output of "lol". Note that I had to shrink the default memory for the jobs because my leader node also somehow had only 1 GB of memory.
cd /tmp
mkdir data
cd data
echo ">seq" > example.fasta
echo "GATTACA" >> example.fasta
wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt
mkdir sequence
touch sequence/lol
mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl
toil-cwl-runner --defaultMemory 100M ls-dir.cwl ls-dir.yml
cat ls-dir-output.txt
Great, I'll give it a try. Thanks Adam, Ngoc
Thank you @adamnovak for tracking this down. Can the docs be clarified about this issue?
I guess this could become another tip in the "Tips" box in https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-cwl-workflow-on-aws
Something along the lines of, if you use the single machine batch system on a cluster head node with a CWL workflow that in turn uses Docker, input/output from/to the local filesystem may not behave as expected, and that you should try using paths under /tmp to work around it.
@adamnovak Please note the size of /tmp is 2G and we cannot run with large dataset. Thanks, Ngoc
What's the use case for the cluster head node deployed by toil launch-cluster
being usable to run a large data set in single machine
mode?
If you are running with the Mesos batch system and AWS job store, like
the toil launch-cluster
cluster is intended to be used, then all the
jobs will run on the worker nodes, and no jobs will run on the cluster
head node. So the head node only needs minimal resources.
If you have a lot of data and you want to process it on a single
machine, on AWS with the single machine batch system, you shouldn't
use toil launch-cluster
. Instead, you should just launch a normal
non-Toil-managed EC2 instance, and install Toil from pip
, and
Docker, if you need it, from your distribution's package manager
(making sure to set it up to be usable by the user you want to use to
run workflows).
Is there a particular reason why you want to be running on the Toil-deployed cluster leader in single-machine mode, rather than on your own machine? Is it just that setup is simpler?
On 4/15/19, ngocnguyen notifications@github.com wrote:
@adamnovak Please note the size of /tmp is 2G and we cannot run with large dataset. Thanks, Ngoc
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/2574#issuecomment-483380131
@adamnovak I have a lot of files that I want to run through my CWL workflow. This is my first step to try to run it on AWS, and we are running into this problem. The document doesn't show me how to run CWL workflow on the worker nodes, can you point me to the right direction? Can you show me how to run the ls-dir CWL workflow on a worker node? Thanks, Ngoc
I wanted to point you at https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl but I think that that part of the docs is actually wrong. It shows you doing something like this:
toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml
But with just --provisioner aws
I think it is going to try and use the single machine batch system with the AWS provisioner, which will never be called upon to provision anything, and the workflow will just run on the leader node.
You probably want the instructions here instead: https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus
It gives a command like:
cactus --provisioner <aws, gce, azure> --nodeType <type> --maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>:<zone>:cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal
You need --batchSystem mesos
to point Toil at the Mesos installation that actually distributes jobs over the cluster, and you need --provisioner
and --nodeTypes
(which the example has as --nodeType
for some reason) to actually have any nodes created. I've also always passed --mesosMaster=$(hostname -i):5050
to point explicitly at Mesos. And it's always good to set min and max nodes.
I think you want to blend all this together for something like, for your workflow:
toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml
That will make the cluster scale up from 0 to a max of 10 nodes as needed, using i3.xlarge nodes, and keeping job state in an AWS job store in us-west-2a named "lsdirstore1".
I'm not sure whether the CWL interpreter is smart enough to pack up your directory you are trying to list and ship it to the node that is going to try to list it. You might end up looking at that node's local filesystem instead. You might need to move your input directory from file:// to s3://.
This really needs to be added to the docs. We don't show CWL with autoscaling and Mesos right now as far as I can tell.
Hi Adam, This is the error I got when running your suggested command, can you help? Thanks, Ngoc
(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws
--batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes
t2.medium --minNodes 0 --maxNodes 10 --jobStore
aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml
INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to
physically available memory (4137283584).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically
available disk (35568545792).
Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 11, in
On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:
I wanted to point you at https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl but I think that that part of the docs is actually wrong. It shows you doing something like this:
toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml
But with just --provisioner aws I think it is going to try and use the single machine batch system with the AWS provisioner, which will never be called upon to provision anything, and the workflow will just run on the leader node.
You probably want the instructions here instead: https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#running-a-workflow-with-autoscaling-cactus
It gives a command like:
cactus --provisioner <aws, gce, azure> --nodeType
--maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>: :cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal You need --batchSystem mesos to point Toil at the Mesos installation that actually distributes jobs over the cluster, and you need --provisioner and --nodeTypes (which the example has as --nodeType for some reason) to actually have any nodes created. I've also always passed --mesosMaster=$(hostname -i):5050 to point explicitly at Mesos. And it's always good to set min and max nodes.
I think you want to blend all this together for something like, for your workflow:
toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml
That will make the cluster scale up from 0 to a max of 10 nodes as needed, using i3.xlarge nodes, and keeping job state in an AWS job store in us-west-2a named "lsdirstore1".
I'm not sure whether the CWL interpreter is smart enough to pack up your directory you are trying to list and ship it to the node that is going to try to list it. You might end up looking at that node's local filesystem instead. You might need to move your input directory from file:// to s3://.
This really needs to be added to the docs. We don't show CWL with autoscaling and Mesos right now as far as I can tell.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493, or mute the thread https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA .
Hm. That looks like your AWS instance is missing a Name tag. In the AWS console, does it in fact have a Name tag?
The Name tag should have been set by toil launch-cluster to the cluster name you gave it (see https://github.com/DataBiosphere/toil/blob/d62429fa29f71df755d3310c802717c5d89b203e/src/toil/provisioners/aws/awsProvisioner.py#L162). If it has gotten lost somehow, you could probably recreate it through the console, or just tear down and redeploy the cluster.
On 4/18/19, ngocnguyen notifications@github.com wrote:
Hi Adam,
This is the error I got when running your suggested command, can you help?
Thanks,
Ngoc
(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws
--batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes
t2.medium --minNodes 0 --maxNodes 10 --jobStore
aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml
INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to
physically available memory (4137283584).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically
available disk (35568545792).
Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py",
line 1274, in main
outobj = toil.start(wf1)
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line
770, in start
self._setProvisioner()
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line
815, in _setProvisioner
sseKey=self.config.sseKey)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/init.py",
line 33, in clusterFactory
return AWSProvisioner(clusterName, zone, nodeStorage, sseKey)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",
line 105, in init
self._readClusterSettings()
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",
line 116, in _readClusterSettings
self.clusterName = str(instance.tags["Name"])
KeyError: 'Name'
(venv) root@ip-172-31-19-65:/data#
On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:
I wanted to point you at
https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl
but I think that that part of the docs is actually wrong. It shows you
doing something like this:
toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml
But with just --provisioner aws I think it is going to try and use the
single machine batch system with the AWS provisioner, which will never be
called upon to provision anything, and the workflow will just run on the
leader node.
You probably want the instructions here instead:
It gives a command like:
cactus --provisioner <aws, gce, azure> --nodeType
--maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>: :cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal You need --batchSystem mesos to point Toil at the Mesos installation that
actually distributes jobs over the cluster, and you need --provisioner
and --nodeTypes (which the example has as --nodeType for some reason) to
actually have any nodes created. I've also always passed --mesosMaster=$(hostname
-i):5050 to point explicitly at Mesos. And it's always good to set min
and max nodes.
I think you want to blend all this together for something like, for your
workflow:
toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml
That will make the cluster scale up from 0 to a max of 10 nodes as needed,
using i3.xlarge nodes, and keeping job state in an AWS job store in
us-west-2a named "lsdirstore1".
I'm not sure whether the CWL interpreter is smart enough to pack up your
directory you are trying to list and ship it to the node that is going to
try to list it. You might end up looking at that node's local filesystem
instead. You might need to move your input directory from file:// to s3://.
This really needs to be added to the docs. We don't show CWL with
autoscaling and Mesos right now as far as I can tell.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA
.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630
Yes, my AWS instance has a Name tag.
On Thu, Apr 18, 2019 at 1:18 PM Adam Novak notifications@github.com wrote:
Hm. That looks like your AWS instance is missing a Name tag. In the AWS console, does it in fact have a Name tag?
The Name tag should have been set by toil launch-cluster to the cluster name you gave it (see
https://github.com/DataBiosphere/toil/blob/d62429fa29f71df755d3310c802717c5d89b203e/src/toil/provisioners/aws/awsProvisioner.py#L162 ). If it has gotten lost somehow, you could probably recreate it through the console, or just tear down and redeploy the cluster.
On 4/18/19, ngocnguyen notifications@github.com wrote:
Hi Adam,
This is the error I got when running your suggested command, can you help?
Thanks,
Ngoc
(venv) root@ip-172-31-19-65:/data# toil-cwl-runner --provisioner aws
--batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes
t2.medium --minNodes 0 --maxNodes 10 --jobStore
aws:us-east-1:lsdirstore1 ls-dir.cwl ls-dir.yml
INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///data/ls-dir.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to
physically available memory (4137283584).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically
available disk (35568545792).
Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py",
line 1274, in main
outobj = toil.start(wf1)
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line
770, in start
self._setProvisioner()
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line
815, in _setProvisioner
sseKey=self.config.sseKey)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/init.py",
line 33, in clusterFactory
return AWSProvisioner(clusterName, zone, nodeStorage, sseKey)
File
"/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",
line 105, in init
self._readClusterSettings()
File
"/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py",
line 116, in _readClusterSettings
self.clusterName = str(instance.tags["Name"])
KeyError: 'Name'
(venv) root@ip-172-31-19-65:/data#
On Wed, Apr 17, 2019 at 7:17 PM Adam Novak notifications@github.com wrote:
I wanted to point you at
https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#awscwl
but I think that that part of the docs is actually wrong. It shows you
doing something like this:
toil-cwl-runner --provisioner aws --jobStore aws:us-west-2a:any-name /tmp/example.cwl /tmp/example-job.yaml
But with just --provisioner aws I think it is going to try and use the
single machine batch system with the AWS provisioner, which will never be
called upon to provision anything, and the workflow will just run on the
leader node.
You probably want the instructions here instead:
It gives a command like:
cactus --provisioner <aws, gce, azure> --nodeType
--maxNodes 2 --minNodes 0 --retry 10 --batchSystem mesos --disableCaching --logDebug --logFile /logFile_pestis3 --configFile /root/cact_ex/blockTrim3.xml <aws, google, azure>: :cactus-pestis /root/cact_ex/pestis-short-aws-seqFile.txt /root/cact_ex/pestis_output3.hal You need --batchSystem mesos to point Toil at the Mesos installation that
actually distributes jobs over the cluster, and you need --provisioner
and --nodeTypes (which the example has as --nodeType for some reason) to
actually have any nodes created. I've also always passed --mesosMaster=$(hostname
-i):5050 to point explicitly at Mesos. And it's always good to set min
and max nodes.
I think you want to blend all this together for something like, for your
workflow:
toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes i3.xlarge --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2a:lsdirstore1 ls-dir.cwl ls-dir.yml
That will make the cluster scale up from 0 to a max of 10 nodes as needed,
using i3.xlarge nodes, and keeping job state in an AWS job store in
us-west-2a named "lsdirstore1".
I'm not sure whether the CWL interpreter is smart enough to pack up your
directory you are trying to list and ship it to the node that is going to
try to list it. You might end up looking at that node's local filesystem
instead. You might need to move your input directory from file:// to s3://.
This really needs to be added to the docs. We don't show CWL with
autoscaling and Mesos right now as far as I can tell.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
< https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484296493>,
or mute the thread
< https://github.com/notifications/unsubscribe-auth/AALSQMWCTIBJPAQNCSWWNILPQ6VZ3ANCNFSM4HB2JJBA
.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484598356, or mute the thread https://github.com/notifications/unsubscribe-auth/AALSQMRB4BOSH2JLIB7HWLDPRCUNHANCNFSM4HB2JJBA .
I'm not able to reproduce this issue. Here's what I did:
pip uninstall toil
pip install --user toil[aws]
TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.19.0 toil launch-cluster amntest2 --keyPairName anovak@kolossus --leaderNodeType t2.micro --zone us-west-2a
toil ssh-cluster -z us-west-2a amntest2
cd /tmp
mkdir data
cd data
wget https://github.com/DataBiosphere/toil/files/3019619/ls-dir.yml.txt
wget https://github.com/DataBiosphere/toil/files/3019617/ls-dir.cwl.txt
mkdir sequence
touch sequence/lol
echo ">seq" > example.fasta
echo "GATTACA" >> example.fasta
mv ls-dir.yml.txt ls-dir.yml
mv ls-dir.cwl.txt ls-dir.cwl
toil-cwl-runner --provisioner aws --batchSystem mesos --mesosMaster=$(hostname -i):5050 --nodeTypes t2.medium --minNodes 0 --maxNodes 10 --jobStore aws:us-west-2:lsdirstore2 ls-dir.cwl ls-dir.yml
And here's the Toil output:
INFO:cwltool:Resolved 'ls-dir.cwl' to 'file:///tmp/data/ls-dir.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to physically available memory (1033838592).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically available disk (45382463488).
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:9jTEr8mxV54i7ChScoFipRmczJrnifx8tTQ6DB8Udi4 root@ip-172-31-23-143.us-west-2.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
| + . |
| * .o . . |
| o =..+ o |
|o * .E.. . . |
|.= o. ..S + . . |
|..o...o= = * o . |
| o ++= o O o o |
| o.*.. o o . |
| ..... . |
+----[SHA256]-----+
Agent pid 86
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
INFO:toil:Running Toil version 3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6.
INFO:toil.leader:Issued job 'file:///tmp/data/ls-dir.cwl' 63f92b9b-c184-4fc1-8572-7d212e1203eb with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
INFO:toil.provisioners.clusterScaler:Adding 1 non-preemptable nodes to get to desired cluster size of 1.
INFO:toil:Using default user-defined custom docker init command of as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
INFO:toil:Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
INFO:toil:Using default docker name of toil as TOIL_DOCKER_NAME is not set.
INFO:toil:Overriding docker appliance of quay.io/ucsc_cgl/toil:3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6 with quay.io/ucsc_cgl/toil:3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6 from TOIL_APPLIANCE_SELF.
INFO:toil.lib.ec2:Creating t2.medium instance(s) ...
INFO:toil.leader:Job ended successfully: 'file:///tmp/data/ls-dir.cwl' 63f92b9b-c184-4fc1-8572-7d212e1203eb
INFO:toil.provisioners.clusterScaler:Removing 1 non-preemptable nodes to get to desired cluster size of 0.
INFO:toil.provisioners.aws.awsProvisioner:Terminating instance(s): [u'i-0aab84ddf907fa0f9']
INFO:toil.provisioners.aws.awsProvisioner:Instance(s) terminated.
INFO:toil.leader:Finished toil run successfully.
{
"example_out": {
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"basename": "ls-dir-output.txt",
"nameext": ".txt",
"nameroot": "ls-dir-output",
"location": "file:///tmp/data/ls-dir-output.txt",
"class": "File",
"size": 0
}
INFO:toil.common:Successfully deleted the job store: <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb98d582dd0>
}
And the contents of the output file:
root@ip-172-31-23-143:/tmp/data# cat ls-dir-output.txt
root@ip-172-31-23-143:/tmp/data#
So I got the workflow to run through successfully when running on the actual cluster nodes. The actual directory was still empty from the perspective of the workflow, but I suspect that's because I'm trying to use a filesystem directory on a cluster that doesn't have a shared filesystem across the nodes; if I want the nodes to be able to see things consistently, I need to point the workflow at S3.
When you set up the cluster, did you put it in a zone in us-east-1
where you put your job store? I tried running with the job store in a different region than my cluster, and I got a bunch of 409 Conflict
errors, but maybe that's not the only failure mode.
@adamnovak I can point a file to S3 object, but how do I point a directory to S3? S3 doesn't have a directory concept.
How do I manage jobStore? I have been testing with different parameters and toil said the jobStore already exist, but I can't delete it use toil clean
You should be able to delete the job store with toil clean; it's something like toil clean aws:us-west-2:lsdirstore2
. If that doesn't work, can you show what it does?
S3 treats common prefixes of objects with slashes in the name as pseudo-directories. I'm not sure if we implemented any support for it in Toil's CWL runner (@DailyDreaming might know), but if you store your files as s3://bucket/path/to/directory/file.whatever
, then if you have s3://bucket/path/to/directory/
as a "directory" you can list all the files in the "directory" and add and remove files through the S3 API. There's just no concept of creating or destroying directories, or having empty directories, and you can also have data at the name corresponding to the "directory" itself.
Yeah, Toil'sCWL runner doesn't recognize s3://bucket/path/to/directory/ as directory with error "No such file or directory".
Traceback (most recent call last):
File "/home/ngoc/venv-toil-all/bin/toil-cwl-runner", line 10, in
That should probably be another issue, to track developing that functionality. This one I think can keep representing the Name problem you described in https://github.com/DataBiosphere/toil/issues/2574#issuecomment-484548630.
@arostamianfar, you run Toil CWL on AWS all the time; have you ever had a problem with the Directory type not being implemented? Or do you just not use it?
@adamnovak I haven't used the Directory
type yet, but looks like native s3 directory support is not implemented (native s3 file support was fixed in #2234 ). As a workaround, I currently use the mesosphere/aws-cli
docker image and have a cp_to_s3.cwl
task that copies files to/from s3 and I can control where/how to copy them.
I'm going to close this; I tried to reproduce the lack of a Name
tag and couldn't, and then we moved on to talking about CWL stuff that also I think is now done.
Hello, I have a cwl that works fine with toil-cwl-runner on my local machine, but it wouldn't run in AWS using quay.io/ucsc_cgl/toil:3.18.0 This cwl need a directory and it mounts fine on my local machine, but that directory is empty when running in AWS.