DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
893 stars 241 forks source link

Allow workflows to work with data from S3 without actually copying it into the AWS JobStore #4147

Closed boyangzhao closed 9 months ago

boyangzhao commented 2 years ago

Hi it seems I'm keep running into space issues when trying to run jobs on AWS clusters using toil. I don't know how much is attributed to configurations on instance selection, and how much can I change with toil arguments.

I always see the /tmp is full (2GB and it's always at 100% usage), but I cannot change the size with umount/mount; I'm using the default ami image. I've tried using the --workDir argument, but it seems this folder was always empty during the runs, and /tmp always full. And because /tmp becomes full, I can then no longer able to ssh to the node (toil ssh-cluster) as the error there would say no space left on /tmp.

In one example, the job would finish, but has the last error (unclear if this is because /tmp is full). I use destBucket so I do see the files being written to the s3 buckets, but why the "no space left error" on the presumably leader node?

[2022-06-15T16:45:57+0000] [scaler ] [I] [toil.provisioners.aws.awsProvisioner] Instance(s) terminated.
[2022-06-15T16:45:59+0000] [MainThread] [I] [toil.leader] Finished toil run successfully.
[2022-06-15T16:46:04+0000] [MainThread] [I] [toil.common] Successfully deleted the job store: <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f13e22f9880>
write /dev/stdout: no space left on device

Other times, I'd get the following space error,

Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/toil/cwl/cwltoil.py", line 3657, in main
outobj = toil.start(wf1)
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 945, in start
self._batchSystem = self.createBatchSystem(self.config)
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1093, in createBatchSystem
return batch_system(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/toil/batchSystems/mesos/batchSystem.py", line 81, in init
super().init(config, maxCores, maxMemory, maxDisk)
File "/usr/local/lib/python3.8/dist-packages/toil/batchSystems/local_support.py", line 34, in init
self.localBatch: SingleMachineBatchSystem = SingleMachineBatchSystem(
File "/usr/local/lib/python3.8/dist-packages/toil/batchSystems/singleMachine.py", line 96, in init
workdir = Toil.getLocalWorkflowDir(config.workflowID, config.workDir) # config.workDir may be None; this sets a real directory
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1331, in getLocalWorkflowDir
os.mkdir(workflowDir)
OSError: [Errno 28] No space left on device: '/var/lib/toil/101edeab8ad35ecd8ebae60c759710d7'

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1189

adamnovak commented 2 years ago

It definitely does look like your leader node does not have enough disk.

What do you mean by "I always see the /tmp is full"? Surely it's empty when you freshly launch the cluster, right? And then it fills up when you run your workflow?

The first error looks like what might happen if your disk fills up right at the end of a workflow, when it is uploading output files to your destination bucket, from the leader.

The second error looks like what might happen if your root filesystem is full and you try to launch a workflow: it tries to create a working directory for jobs that run on the leader, and it can't, because the disk is already full. When you got that, was the disk not full before you launched the workflow (and thus something earlier in the workflow startup process is filling it up)? Because I think input file import happens before that step, so it could be that importing the input files for the workflow is filling up your disk.

You can try using the TMPDIR environment variable to move where Toil keeps its temporary files, but that might not help if the leader node doesn't actually have a big enough disk for them to go on.

To make there be more space on the leader node, you can pass toil launch-cluster the --leaderStorage option, to tell it to assign the leader a larger-than-default EBS disk. You could also pick a leader node type that comes with its own local storage. For example, if you use an m5d.xlarge, the cluster will be set up with Toil's working directories on the 150G SSD that those instances come with. I'm don't think that /tmp will be there too, if /tmp is specifically filling up, but you could use TMPDIR to move it over to one of the directories (like /var/lib/toil) that is.

The real question is, why is the disk filling up on the leader? I think that what is happening is that our CWLTool-based file manipulations are causing local copies of the inputs and outputs to be created, when one might not expect to need the space for them. That's the good possible problem, because it should be easier to fix, and the workaround is to just make sure that your leader has room on disk for all the workflow inputs and outputs.

The bad possible problem is that the CWL intermediate jobs that execute on the leader might be getting file and directory objects staged to disk on the leader. @boyangzhao do you only see the errors you posted, or do workflows also sometimes run out of space in the middle?

To import the inputs, especially if there are Directory objects, we make a bunch of calls into CWLTool's file management functions, and they make a bunch of calls back into ToilFsAccess, and ToilFsAccess still falls back to making copies of things on disk to fulfill some requests. It also doesn't clean up after itself, I think partly because we initially designed it to work only in jobs, where cleanup would be handled automatically at the end of the job, and partly because the CWLTool StdFsAccess API doesn't have a hook for being notified when file paths you handed out are no longer needed. (This could also be why /tmp is filling up and not --workDir, because the logic to put temporary files in --workDir belongs to jobs and won't work on a ToilFsAccess in the leader itself.)

On the export side, I don't think we intend for files to be downloaded to disk before being uploaded to the destination bucket, but it might be happening anyway?

@mr-c I think we can address some of this by making ToilFsAccess a little smarter about not using _abs, and maybe getting all the ToilPathMappers and all the ToilFsAccesses to share a global cache with some kind of cleanup. But I think we might also need some help from the CWLTool side, because the APIs that StdFsAccess and PathMapper implement, and the guarantees they require, are not really fully documented as far as I can tell.

When can the path returned by StdFsAccess.realpath() safely be deleted by the StdFsAccess implementation? Does it ever need to outlive the StdFsAccess` object that it was retrieved from?

boyangzhao commented 2 years ago

What do you mean by "I always see the /tmp is full"? Surely it's empty when you freshly launch the cluster, right? And then it fills up when you run your workflow?

Yes the tmp is reset when I freshly launch a cluster or when the job successfully completes. The tmp fills up quickly, for example if I use a t2.medium (which of course is pretty small; and this I've used larger ones; but the observation is similar): the tmp is capped at a low GB even though the SSD has plenty more (example below had 2GB tmp but 50GB SSD).

Filesystem      Size  Used Avail Use% Mounted on
overlay          47G  2.3G   42G   6% /
tmpfs            64M     0   64M   0% /dev
shm              64M     0   64M   0% /dev/shm
/dev/xvda9       47G  2.3G   42G   6% /opt
tmpfs           2.0G  4.0K  2.0G   1% /tmp
overlay          47G  2.3G   42G   6% /var/lib/docker/overlay2/2ea590ed769248f2f90e8db29b69114d0abc220b040d12b024b176d6ddba2fca/merged
tmpfs           786M  8.7M  777M   2% /run/docker.sock
tmpfs           2.0G     0  2.0G   0% /proc/acpi
tmpfs           2.0G     0  2.0G   0% /proc/scsi
tmpfs           2.0G     0  2.0G   0% /sys/firmware

do you only see the errors you posted, or do workflows also sometimes run out of space in the middle?

I have seen other errors, like the one below, it does mention Failed job accessed files, but I haven't gotten to troubleshoot this one yet, I don't know if that was related to storage, or something else yet.

...
Traceback (most recent call last):
File "/usr/local/bin/OptiType/OptiTypePipeline.py", line 310, in <module>
pos, read_details = ht.pysam_to_hdf(bam_paths[0])
File "/usr/local/bin/OptiType/hlatyper.py", line 239, in pysam_to_hdf
pos_df = pd.DataFrame.from_items(iter(hits.items())).T
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1261, in from_items
keys, values = lzip(*items)
ValueError: need more than 0 values to unpack
[2022-06-13T20:27:44+0000] [MainThread] [I] [cwltool] [job optitype] Max memory used: 317MiB
[2022-06-13T20:27:44+0000] [MainThread] [W] [cwltool] [job optitype] exited with status: 1
[2022-06-13T20:27:44+0000] [MainThread] [W] [cwltool] [job optitype] completed permanentFail
[2022-06-13T20:27:44+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2022-06-13T20:27:44+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file '2b02c105-2bd0-478e-845c-ed9c808c931d' to path '/var/lib/toil/a40bca046fe858d88b3df36c0c855e26/4ebd/3443/tmpqjduxejh.tmp'
[2022-06-13T20:27:44+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'abd9bbe6-8f1c-4305-bef7-c399785c5e6e' to path '/var/lib/toil/a40bca046fe858d88b3df36c0c855e26/4ebd/3443/tmpv8q7x3_j.tmp'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/toil/worker.py", line 407, in workerScript
job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "/usr/local/lib/python3.8/dist-packages/toil/job.py", line 2406, in _runner
returnValues = self._run(jobGraph=None, fileStore=fileStore)
File "/usr/local/lib/python3.8/dist-packages/toil/job.py", line 2324, in _run
return self.run(fileStore)
File "/usr/local/lib/python3.8/dist-packages/toil/cwl/cwltoil.py", line 2220, in run
raise cwltool.errors.WorkflowException(status)
cwltool.errors.WorkflowException: permanentFail
[2022-06-13T20:27:44+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host ip-172-31-37-180.ec2.internal
<=========

I will try to use TMPDIR. I want to in the end minimize the storage usage of the leader node (the workers have plenty of capacity of course), which hopefully means on leader node the working directory is cleared after run (or after worker is finished?), minimal footprint when it comes to handling/staging inputs/outputs. If I have a multi-component workflow that in total the input files total up to 300GB (on AWS s3), but subworkflow A needs a subset of the input files (say 40GB), other subworkflows need 30GB, and the final outputs of the workflow is 50GB. Does that mean the leader node need to have 350GB of storage (with TMPDIR defined properly), or 50GB?

adamnovak commented 2 years ago

That error you gave looks like an actual error in your workflow, not in Toil itself:

File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1261, in from_items
keys, values = lzip(*items)
ValueError: need more than 0 values to unpack

It looks like your /tmp is actually in memory, and not on disk, which is why so little space is available for it. It's being provided by a tmpfs in-memory filesystem.

I don't think we want Toil clusters to be set up like that, so that's probably its own bug. We want an on-disk /tmp.

The other bug here is that the leader is requiring nontrivial local space to import the inputs and export the outputs. In both cases the data is going from S3 to S3, so there's no good reason why we shouldn't be able to do an S3-side copy, or if that can't work, stream to the leader and back to S3 without hitting the disk at all.

We shouldn't need to have even enough storage for the largest file, but I'm unsure of the current behavior. I would suspect that enough storage for all of the inputs and outputs would be enough, but if we're managing to e.g. download directories alongside their subdirectories, it could actually need more than that to work as is.

boyangzhao commented 2 years ago

With TMPDIR, I can now redirect the tmp to be on disk. But indeed, it seems that the tmp fills up very quickly (it uses in one job 31GB, seems to be during the initial uploading phase, before starting the analysis). But at least with TMPDIR, this resolves the issue with the no space left on device (for the smaller subworkflows I've tested), but looks inefficient (as everything is on s3) and also it looks like the tmp files are not deleted after job completion.

The other space constraints was on disk, I also observe the program stalls because of it unable to find the right node type, even though my node list has t2.medium, t2.large, t2.xlarge, c4.large, c4.xlarge, c4.2xlarge, r5.large, r5.xlarge, r5.2xlarge, r5.4xlarge, which should have enough memory and cores required; but may be getting stuck on the disk requirements.

Couldn't fit job with requirements Shape(wallTime=55.73192800794329, memory=33554432000, cores=1, disk=105931341824, preemptable=True) into any nodes in the nodeTypes list.

I see there are options like --nodeStorage and/or --defaultDisk we can define, but is the issue here an issue of nodeStorage or disk? what is disk here (the docker container for the toil application limit? the default is 2GB). Also, this seems to apply to all worker nodes (rather than specific to a cwl tool or subworkflow). The cwl allows definitions of tmpdirMin or outdirMin, but does this impact directly this disk requirement?

adamnovak commented 2 years ago

"Storage" and "disk" mean the same thing in Toil and are used interchangeably. --nodeStorage sets how big the EBS volumes for the worker nodes should be, in gigabytes, with a default of 50. --defaultDisk sets how much disk Toil should reserve for a job that doesn't specify how much disk space it is going to need; it defaults to 2 GB. If you are using CWL's disk requirements consistently in your workflow to specify how much space each step needs, this shouldn't matter much. I think we combine all the CWL disk requirement values from each job to determine how much disk space that job is going to use, if any are set; Toil doesn't distinguish between CWL tmpdir and outdir space.

So it looks like your cluster has all those node types, but since none of those nodes have their own local storage, they are all being given 50 GB EBS volumes for storage, and they can only actually offer a bit less than that, because of disk space used by the OS.

Then you are trying to run a job that uses 100 GB of disk, and it doesn't fit.

To solve that, you could do something like --nodeStorage 120 to make all the nodes' EBS volumes 120 GB, so that at least one such job could fit on each node. You could go bigger than that, if you want more space per core or per GB of memory.

As you note, we only have one setting per node type, which means that setting it high enough for the big instances might mean you are giving a lot more disk than you need for the small instances.

So another option is to use the d versions of the AWS instance types, which have their own node-local storage, witch Toil knows how to use, as part of the instance. For example, r5d.4xlarge has 2 SSDs of 300 gigs each, for a total of ~600GB of space, with a marginal cost over r5.4xlarge that is probably less than you would pay for a 600GB EBS volume. There aren't d versions of the c4 instances, but there are for c5: you can use c5d.2xlarge which has 8 cores, 16 GB of memory, and ~200 GB of local disk included.

You can use both the d and non-d versions of instances in your instance type list, and the autoscaling logic will figure out when it needs instances with extra storage and when it doesn't, and use them appropriately.

As for the inefficiency when uploading from S3 to S3: we know about it, and we're planning to address it at some point.

boyangzhao commented 2 years ago

Thanks. Using a bigger EBS and/or using d resolved those issues. As I'm unit testing my more involved workflows, it seems the run would stall. I see for example,

...
[2022-06-28T17:16:10+0000] [scaler ] [I] [toil.lib.ec2] Creating t2.large instance(s) ...
[2022-06-28T17:17:07+0000] [scaler ] [I] [toil.provisioners.clusterScaler] Removing 1 non-preemptable nodes to get to desired cluster size of 0.
[2022-06-28T17:17:07+0000] [scaler ] [I] [toil.provisioners.aws.awsProvisioner] Terminating instance(s): ['i-0136b2585bf5fbe1c']
[2022-06-28T17:17:07+0000] [scaler ] [I] [toil.provisioners.aws.awsProvisioner] Instance(s) terminated.
[2022-06-28T17:18:08+0000] [scaler ] [I] [toil.provisioners.aws.awsProvisioner] Terminating instance(s): ['i-0d90f72889c69e56c']
[2022-06-28T17:18:08+0000] [scaler ] [I] [toil.provisioners.aws.awsProvisioner] Instance(s) terminated.
[2022-06-28T18:03:08+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2022-06-28T19:03:09+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2022-06-28T20:03:10+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run

or

...
[2022-06-30T06:04:58+0000] [scaler ] [I] [toil.lib.ec2] Creating r5.xlarge instance(s) ...
[2022-06-30T06:56:59+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 30 jobs are issued and waiting to run
[2022-06-30T07:57:00+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 30 jobs are issued and waiting to run
...

This would be like this forever, with no activity happening and the job has to be manually terminated.

boyangzhao commented 2 years ago

@adamnovak Is there a way to see what toil is currently doing? I've managed to switch over the the spot instances (something else I was working on), but it still keep getting stuck at jobs being issued and waiting to run for hours. I've tried to pull the grafana dashboard but wasn't quite clear how to view in browser (separate question, posted on #4083). At least from looking at the Monitoring activities of the node from AWS EC2 online, there isn't much activity going on at the node. Initially thought maybe it's because it is trying to load in all the files from Directory onto the node and it's taking a long time, but I observe this same issue with workflows with no large input files/directories. Are there any other logs available to see what toil is currently doing? Thanks!

...
[2022-07-07T17:07:07+0000] [scaler ] [I] [toil.lib.ec2] 1 spot requests(s) are open (1 of which are pending evaluation and 0 are pending fulfillment), 0 are active and 0 are in another state.
[2022-07-07T17:07:07+0000] [scaler ] [I] [toil.lib.ec2] Sleeping for 10s
[2022-07-07T17:07:18+0000] [scaler ] [I] [toil.lib.ec2] 0 spot requests(s) are open (0 of which are pending evaluation and 0 are pending fulfillment), 1 are active and 0 are in another state.
[2022-07-07T17:49:07+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2022-07-07T18:49:07+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2022-07-07T19:49:08+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2022-07-07T20:49:09+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
adamnovak commented 2 years ago

Toil doesn't have its own dashboard for looking at jobs that are pending vs. issued vs. running. When jobs are "issued" that means that they are created in the backing scheduler (Mesos, Kubernetes, Slurm, etc.), and Toil is waiting for the backing scheduler to start running them. If they never go from issued to running the place to look is whatever you are using as your batch system.

It looks like you are using Mesos, so you would want to look at Mesos's dashboard to see what is going on. Mesos is a little different because it does things backward; you don't hand jobs to Mesos, and instead Mesos hands resource offers to you. So the Mesos dashboard won't show that a job is waiting, but it should show the offers that are available to run jobs in, and if there aren't any that are big enough to run your job, you could see that.

(To find out how big the waiting job is, you'd have to read the Toil log and look for whatever job it said was issued but didn't say actually ran, and then look for where the size of that job was logged.)

To actually get at the Mesos dashboard, you have to browse to port 5050 on the leader instance. Since the Mesos web interface now has opinions about what its hostname should be, and since some of its functionality involves fetching stuff from workers in the browser, it's easiest to do this with a proxy. So you could toil ssh-cluster your-cluster --sshOption="-D8080" to set up a SOCKS proxy on your local port 8080, and then point your browser's proxy settings at localhost port 8080 (and probably check the box for proxying DNS), and then go to http://localhost:5050 to view the dashboard.

You can also try toil ssh-cluster your-cluster --sshOption="-L5050:loclahost:5050" to forward just port 5050, so you don't need to change proxy settings. That used to work OK with older versions of the Mesos dashboard, but I noticed it didn't work very well anymore after our most recent Mesos upgrade; the dashboard would load but would then complain that it couldn't connect back to Mesos.

Anyway, in the Mesos dashboard you can see what worker nodes are up and what resources they are offering. If you have a job that isn't scheduling, probably not enough nodes are up and offering enough resources to actually run it. It looks like what might be happening is that we expect a spot node to be coming up with a certain amount of resources, and it either isn't coming up, or it isn't offering all the resources we thought it would when it does come up.

boyangzhao commented 2 years ago

Thanks for the info - I can see the status on the mesos dashboard now (I just called toil ssh-cluster --zone us-east-1a <cluster-name>, and the localhost:5050 works, except it has this Failed to connect to localhost:5050! pop-up every once in awhile, but it seems the content of the dashboard still gets updated).

The issue seems to be that in the workflows, there are other jobs issued and finished fine, but on the job that was requesting memory 62.5G, the job would be issued, but this job gets stuck and won't go into run mode. From this log, this is referring to this job,

[2022-07-09T16:15:46+0000] [MainThread] [I] [toil.leader] Issued job 'CWLJob' wf_append.generate_fasta 13d0f8e5-d5f9-4a95-b289-d1a80d7e34c3 v1 with job batch system ID: 10 and cores: 1, disk: 2.0 Gi, and memory: 62.5 Gi

But there is one available agent with 61.0 GB memory (this was a r5.2xlarge node), Screen Shot 2022-07-09 at 8 16 38 PM

The offers are empty, Screen Shot 2022-07-09 at 8 16 22 PM

With this, the job is forever stuck at 0 jobs are running, 1 jobs are issued and waiting to run. Why would it get stuck this like, is this the batch system's fault, or toil needs tell it to create a new node with more resources available (in the node type list, I also have r5.4xlarge listed, so at least it should be possible a higher mem node can be created, but it somehow didn't).

I see this for other jobs, like when [2022-07-09T19:40:27+0000] [MainThread] [I] [toil.leader] Issued job 'CWLJob' mutect.mutect2 164d96d5-2d32-47c8-8f0c-3b3fc52794ad v1 with job batch system ID: 12 and cores: 1, disk: 98.7 Gi, and memory: 31.2 Gi was issued, the node/agent available was r5.xlarge with mesos saying 0 B / 29.9 GB (Mem (Allocated / Total)), and this job was also then get stuck forever.

boyangzhao commented 2 years ago

@adamnovak Any chance do you know what is the priority on the epic/issue of inefficiency when uploading from S3 to S3 (weeks, months, or years)? It seems right now impossible to run any workflows where Directory is input (with deep and large children directories - but these are all expected inputs in a standard workflow).

So far I have tried to run some of these workflows but seems it will never finish because of the upload issue. With deep Directory inputs (like the offline VEP), toil is stuck with doing ParseResult for over a day and a half and still keep going, it'll never finish.

...
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/17000001-18000000.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/17000001-18000000_reg.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/18000001-19000000.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/18000001-19000000_reg.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/19000001-20000000.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/19000001-20000000_reg.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/20000001-21000000.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/20000001-21000000_reg.gz', params='', query='', fragment='') items: []
[2022-07-20T22:47:34+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='test', path='/ref/ensembl/homo_sapiens/95_GRCh38/21/2000001-3000000.gz', params='', query='', fragment='') items: []
...

I have also tried to use AWS storage gateway to mount the S3 as a NFS to the node. The mounting works, but from the log, I would again get toil trying to uploading all the s3 data, which will take days.

...
Uploading part 33 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 34 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 35 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 36 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 37 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 38 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 39 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 40 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 41 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
Uploading part 42 of 52428800 bytes to mDpJxbosJlDtpps0zTavHc7L.zJmSukKpQxqjkIUXCp8DbDN_RHSOJYgY6ua7L5DFZoFrCZWwH8ZpeFzWszjBvb_.wJ_2TNqs_jSGIoNJ68.2sfDnNPyhfw85IXiI.hig8sI20LLe1bjAyRccYx8vg--
...

Thanks!

adamnovak commented 2 years ago

I think your stuck autoscaling problem is you hitting https://github.com/DataBiosphere/toil/issues/2103, similar to the disk space issue.

The autoscaler thinks the node has enough memory to run the job (maybe it's a 64 Gi machine and a 62.5 Gi job), but Mesos only actually gets 61 gigs to offer, with the rest being reserved or being used by the OS or something. So the scaler won't scale up, but the job also won't run.

If you swap over to a Kubernetes-based cluster (-t kubernetes to toil launch-cluster), then we use Kubernetes' scaler, which is a bit more robust I think. Though that can result in new inefficiencies when the workflow goes to run containers; I think we're defaulting to Singularity containerization for CWL workflows then, which works on all Kubernetes clusters, even though we could use sibling Docker containers like we do on Mesos on the clusters we ourselves deploy.

For the S3 to S3 copies:

Looking at the code for importing files from S3 to an S3 job store, we do do server-side S3 copies when we can, because we had to handle the case where AWS refuses to do them. I also looked again at the CWL file access code and it looked like it wasn't obviously doing the unnecessary abs() and ensuing downloading that I thought it would be. So S3 to S3 copies really ought to be being done in an efficient way, and the listing is all supposed to work on the S3 URLs directly, so I'm not sure why it would be slow or filling up /tmp.

All that ParseResult logging is coming from our listing function, so you are seeing it recursively listing your directory. Those messages are all timestamped the same; is it really spending a day and a half only listing the directory, and not actually copying anything? Maybe we have managed to repeatedly re-list the same directories over and over, re-starting from each internal node. It would be great if you could share the whole log so I could see if it looks like that is what it is doing.

If you mount the inputs via NFS and then still use an S3 job store, Toil is first going to need to upload the whole directory from the NFS share to S3, which is not going to work if it is prohibitively large. And an NFS job store won't work reliably unless your NFS share is tuned for full POSIX consistency.

I think fixing needless inefficiency in the copy operations (like it maybe dumping a bunch of files to disk) is relatively high priority for us to fix, but not as high as it could be (or as easy to fix) because we at UCSC don't have any workflows that hit this. Would you be able to share a workflow we can run to reproduce the problem, with a data set that's big enough?

If the data set is actually so big that S3 server-side copying it from one bucket to another is going to take a prohibitively long time, than I think somehow dealing with that is probably lower-priority for us. We've been kicking around https://github.com/DataBiosphere/toil/issues/4122 and https://github.com/DataBiosphere/toil/issues/1917 as potential ways to let CWL workflows be more efficient by maybe just passing around URLs to the input data in place, all the way down to the final tools, but that's a lot more speculative.

boyangzhao commented 1 year ago

@adamnovak following up the S3 bucket to S3 job store copies, has there been an option yet to optimize for this? I have been able to optimize a lot of the rest, but one thing that keeps it still inefficient, is at the start of each job, the input files stored on another S3 bucket are copied to the S3 bucket of the created job store. This means sometimes I have jobs that would only take 1 hour to finish, but would first need 3 hours doing copy operations - so end up quadruple the amount of time really needed for a job. I was reading about the linkImports option, but it didn't seem to matter with or without, toil still does an initial copy.

adamnovak commented 1 year ago

@boyangzhao We haven't yet come up with or implemented a way to optimize this, unfortunately. --linkImports only does anything for the local file job store; S3 doesn't have a concept of a symbolic link as far as I know.

What we probably want to do is just pass around the initial S3 URL as if it were/as part of a Toil file ID, and handle accessing the bucket it is coming from instead of Toil's scratch bucket/SimpleDB storage stack when reading the file. This would require some engineering in the AWS job store, which @DailyDreaming is currently working on ripping SimpleDB out of, but it might be able to be done in parallel.

boyangzhao commented 1 year ago

@adamnovak Thanks. Actually taking detailed look at the logging, I wanted to confirm if the bottleneck is indeed on the s3 to s3 transfer. So all my jobs finish, but takes about 3 hours first to do the IO/transfers before batch system kicks in and job is run. Also I do observe that on my lead node, the disk storage usage goes up by several hundred gigabytes in this first 3 hours, filing up the tmp location (I have a 1TB EBS, so it's never been an out of disk problem anymore).

The first two hours are spent doing a bunch of GetObjectTask and IOWriteTask operations. And another hour spent doing a bunch of ToilFsAccess, DownloadSubmissionTask, and ImmediatelyWriteIOGetObjectTask, CopySubmissionTask, CreateMultipartUploadTask operations. The toil trying to download all the input files from s3 first to local tmp, and then uploading to s3 job store, or is it doing a s3 to s3 transfer directly?

Examples of the first two hours of GetObjectTask and IOWriteTask

[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_5] [D] [s3transfer.tasks] Executing task GetObjectTask(transfer_id=0, {'bucket': 'bucket1', 'key': 'test.fq.gz', 'extra_args': {'Range': 'bytes=37681627136-37690015743'}}) with kwargs {'client': <botocore.client.S3 object at 0x7f17d127adc0>, 'bucket': 'bucket1', 'key': 'test.fq.gz', 'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'extra_args': {'Range': 'bytes=37681627136-37690015743'}, 'callbacks': [], 'max_attempts': 5, 'start_index': 37681627136, 'download_output_manager': <s3transfer.download.DownloadSeekableOutputManager object at 0x7f17c05996d0>, 'io_chunksize': 262144, 'bandwidth_limiter': None}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37614256128}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37614256128}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37614256128}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37614256128}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_6] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37642043392}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_6] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37642043392}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37642043392}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37642043392}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37642043392}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_8] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37674024960}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_8] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37674024960}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37674024960}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37674024960}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37674024960}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_0] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37630246912}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_0] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37630246912}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37630246912}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37630246912}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37630246912}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_4] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37660393472}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_4] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37660393472}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37660393472}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37660393472}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37660393472}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_8] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37674287104}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_8] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37674287104}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37674287104}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37674287104}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37674287104}
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.utils] Releasing acquire 0/None
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_3] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37667995648}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_3] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_9] [D] [s3transfer.futures] Submitting task IOWriteTask(transfer_id=0, {'offset': 37650694144}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0521ca0> for transfer request: 0.
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-3_9] [D] [s3transfer.utils] Acquiring 0
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37667995648}) about to wait for the following futures []
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] IOWriteTask(transfer_id=0, {'offset': 37667995648}) done waiting for dependent futures
[2023-08-31T23:05:18+0000] [ThreadPoolExecutor-5_0] [D] [s3transfer.tasks] Executing task IOWriteTask(transfer_id=0, {'offset': 37667995648}) with kwargs {'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d11cb130>, 'offset': 37667995648}

Examples of the next hour on either ToilFsAccess, DownloadSubmissionTask, and ImmediatelyWriteIOGetObjectTask

[2023-09-01T01:02:51+0000] [MainThread] [D] [toil.cwl.cwltoil] ToilFsAccess has JobStore-supported URL s3://bucket1/GRCh38/ensembl/GRCh38.ensembl97.fasta.sa at /app/tmp/tmpvs9vhkq_
[2023-09-01T01:02:51+0000] [MainThread] [D] [toil.lib.aws.utils] Found in ParseResult(scheme='s3', netloc='bucket1', path='/GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', params='', query='', fragment='') items: []
[2023-09-01T01:02:51+0000] [MainThread] [D] [toil.cwl.cwltoil] ToilFsAccess fetching file s3://bucket1/GRCh38/ensembl/GRCh38.ensembl97.fasta.fai from a JobStore
[2023-09-01T01:02:52+0000] [MainThread] [D] [s3transfer.utils] Acquiring 0
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8272_0] [D] [s3transfer.tasks] DownloadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c072a100>}) about to wait for the following futures []
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8272_0] [D] [s3transfer.tasks] DownloadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c072a100>}) done waiting for dependent futures
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8272_0] [D] [s3transfer.tasks] Executing task DownloadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c072a100>}) with kwargs {'client': <botocore.client.S3 object at 0x7f17d127adc0>, 'config': <boto3.s3.transfer.TransferConfig object at 0x7f17c0a25370>, 'osutil': <s3transfer.utils.OSUtils object at 0x7f17d614ed90>, 'request_executor': <s3transfer.futures.BoundedExecutor object at 0x7f17c07957f0>, 'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c072a100>, 'io_executor': <s3transfer.futures.BoundedExecutor object at 0x7f17c0b2a340>}
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8272_0] [D] [s3transfer.futures] Submitting task ImmediatelyWriteIOGetObjectTask(transfer_id=0, {'bucket': 'bucket1', 'key': 'GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c07957f0> for transfer request: 0.
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8272_0] [D] [s3transfer.utils] Acquiring 0
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8271_0] [D] [s3transfer.tasks] ImmediatelyWriteIOGetObjectTask(transfer_id=0, {'bucket': 'bucket1', 'key': 'GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', 'extra_args': {}}) about to wait for the following futures []
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8271_0] [D] [s3transfer.tasks] ImmediatelyWriteIOGetObjectTask(transfer_id=0, {'bucket': 'bucket1', 'key': 'GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', 'extra_args': {}}) done waiting for dependent futures
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8271_0] [D] [s3transfer.tasks] Executing task ImmediatelyWriteIOGetObjectTask(transfer_id=0, {'bucket': 'bucket1', 'key': 'GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', 'extra_args': {}}) with kwargs {'client': <botocore.client.S3 object at 0x7f17d127adc0>, 'bucket': 'bucket1', 'key': 'GRCh38/ensembl/GRCh38.ensembl97.fasta.fai', 'fileobj': <tempfile._TemporaryFileWrapper object at 0x7f17d37dc880>, 'extra_args': {}, 'callbacks': [], 'max_attempts': 5, 'download_output_manager': <s3transfer.download.DownloadSeekableOutputManager object at 0x7f17c0d2a820>, 'io_chunksize': 262144, 'bandwidth_limiter': None}

Or some uploading to the s3 jobstore with ToilFsAccess, CopySubmissionTask, CreateMultipartUploadTask operations.

[2023-09-01T01:02:52+0000] [MainThread] [D] [toil.cwl.cwltoil] ToilFsAccess has JobStore-supported URL s3://bucket1/GRCh38/ensembl/GRCh38.ensembl97.dict at /app/tmp/tmpeuikpc7z
[2023-09-01T01:02:52+0000] [MainThread] [D] [s3transfer.utils] Acquiring 0
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.tasks] CopySubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c06554f0>}) about to wait for the following futures []
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.tasks] CopySubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c06554f0>}) done waiting for dependent futures
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.tasks] Executing task CopySubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c06554f0>}) with kwargs {'client': <botocore.client.S3 object at 0x7f17c0361b20>, 'config': <boto3.s3.transfer.TransferConfig object at 0x7f17d014d4f0>, 'osutil': <s3transfer.utils.OSUtils object at 0x7f17c00420a0>, 'request_executor': <s3transfer.futures.BoundedExecutor object at 0x7f17c0655430>, 'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f17c06554f0>}
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.futures] Submitting task CreateMultipartUploadTask(transfer_id=0, {'bucket': 'jobstore1--files', 'key': '513ff6db-e9d6-48ff-af45-b644a7bd0fad', 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0655430> for transfer request: 0.
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.utils] Acquiring 0
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8277_0] [D] [s3transfer.tasks] CreateMultipartUploadTask(transfer_id=0, {'bucket': 'jobstore1--files', 'key': '513ff6db-e9d6-48ff-af45-b644a7bd0fad', 'extra_args': {}}) about to wait for the following futures []
[2023-09-01T01:02:52+0000] [ThreadPoolExecutor-8278_0] [D] [s3transfer.futures] Submitting task CopyPartTask(transfer_id=0, {'bucket': 'jobstore1--files', 'key': '513ff6db-e9d6-48ff-af45-b644a7bd0fad', 'part_number': 1, 'extra_args': {'CopySourceRange': 'bytes=0-8388607'}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f17c0655430> for transfer request: 0.

The command line looks something like,

toil-cwl-runner --provisioner aws --batchSystem mesos --destBucket s3://bucket1/output/ --jobStore aws:us-west-1:jobstore1 --clean onSuccess --nodeStorage 500 --nodeTypes t2.medium:2,t2.large:2,t2.xlarge:2,c4.xlarge:2,c5.2xlarge:2,c5.4xlarge:2,r5.large:2,r5.xlarge:2,r5.2xlarge:2,r5.4xlarge:2,r5.8xlarge:3,r5d.xlarge:2,r5d.2xlarge:2,r5d.4xlarge:2,r5d.8xlarge:3,c5d.large:2,c5d.2xlarge:2,c5d.4xlarge:3 --maxNodes 15 --retryCount 5 --defaultPreemptable True --preemptableCompensation 0.2 --logFile /tmp/logfile.txt --maxLogFileSize 100000000 --writeLogs /app/logs/ --no-read-only --linkImports --rotatingLogging --logDebug /app/workflow.cwl /tmp/job.yaml
adamnovak commented 11 months ago

When we try and fail to do an S3 to S3 copy, we log a message that says that Amazon is refusing to perform a server-side copy of the file in question: https://github.com/DataBiosphere/toil/blob/ef2b923cd400a2644f3bfd0953a76c3208e9ab81/src/toil/jobStores/aws/utils.py#L386C1-L386C1

So if that message isn't there for a file, we should be doing a server-side copy. I think that the part where there are a bunch of messages about CopySubmissionTask and CreateMultipartUploadTask is the part where the S3 to S3 transfers happen.

But I think we're also downloading all the files to the local machine, and also I think leaving them there. For example, the second thing we do when importing the input files is to run cwltool's add_sizes function over all the inputs: https://github.com/DataBiosphere/toil/blob/ef2b923cd400a2644f3bfd0953a76c3208e9ab81/src/toil/cwl/cwltoil.py#L1619

But this calls back into the ToilFsAccess size() method, which we haven't set up to handle s3: URIs except by calling the base cwltool size():

https://github.com/DataBiosphere/toil/blob/ef2b923cd400a2644f3bfd0953a76c3208e9ab81/src/toil/cwl/cwltoil.py#L1324-L1326

And in turn that will call into ToilFsAccess's _abs() to download the file, and then look at the downloaded file just to get the size.

I think we need to:

And on the WDL side:

unito-bot commented 11 months ago

➤ User 8dbbc commented:

Do we want --linkImports to control whether we copy or fetch from the remote URL every time? Or do we want a new option?

Do we need to beef up the caching system with data gravity to attract jobs? Or have a way to cache on a node after all the jobs leave?

Do we want a new CWL hint for co-scheduling and some way to support it in Toil?

mr-c commented 11 months ago

@mr-c I think we can address some of this by making ToilFsAccess a little smarter about not using _abs, and maybe getting all the ToilPathMappers and all the ToilFsAccesses to share a global cache with some kind of cleanup. But I think we might also need some help from the CWLTool side, because the APIs that StdFsAccess and PathMapper implement, and the guarantees they require, are not really fully documented as far as I can tell.

I'm happy to extend the cwltool API as needed, yes! Please let me know what needs clarifying/changing.

I think we need to:

  • Fix ToilFsAccess to support all operations on jobstore-implemented URI schemes without downloading. Remove the super() fallbacks.

Agreed. I found your branch and left some comments.

  • Change import_files to notice when a file or directory structure would be copied from one S3 bucket to another, and (optionally?) don't do that, so we pass around the original S3 URI.

Hmm.. That function doesn't have a concept of a destination.