Open larryns opened 1 year ago
We've only ever used the AWS Batch system as part of running Toil under AWS Genomics CLI, which can use it to set up a WES server to run CWL workflows.
In that environment, the Toil WES server is run under this script, and that script sets up workflows to always run with --disableCaching
to turn off the caching system.
We set it up that way because AWS Batch doesn't actually support scheduling jobs onto nodes while accounting for disk space as a resource, and AGC deals with that with a daemon on the nodes that tries to grow the nodes' local disks faster than they can fill. The Toil caching system assumes that the amount of space on the filesystem when it starts up is the amount it is going to have available, and doesn't expect the disks to grow like that, and we haven't wanted to think about how to tell if and when you want to evict things from the cache if the disk just grows indefinitely.
If you add --disableCaching
to the workflow command line, it should work around this particular problem. As long as you also have automatically-growing disks on your nodes, or your nodes all have local disks that are already big enough to handle running all the jobs in your workflow that would otherwise fit to schedule on them, then it ought to work.
If your disks aren't resizing themselves, I'm not sure why the caching system wouldn't work just because you are on AWS Batch. It looks like, when the caching system goes to upload a particular file that this job wrote, at the end of the job, the file has vanished and is no longer there to be copied to S3. But nothing in the log indicates where it has gone or who might have deleted it; for that we might need the debug-level logs for the whole workflow.
@larryns How exactly are you setting up AWS Batch to use it with Toil, if you aren't going through AGC? Is there a particular AMI you are using as a host machine? Do you have some kind of disk-growing daemon? Are you using some kind of AWS-provided fancy elastic filesystem to back /var/lib/toil
on the host?
Thanks for the info @adamnovak To give you a little more background. I'd gotten the aws_batch to work with the examples, but then upgraded from 5.8.0 to 5.9.2 and something broke. I downgraded back to 5.8.0 and couldn't get it to work again. For the life of me, I can't figure out what's different.
We're not using AGC; the batch environments and queues were set up pretty much through the console. We're using ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220912 for the AMI (ami-08c40ec9ead489470) on the host node. I don't believe there's anything out of the ordinary for the disk other than basic EBS. But I'll do some more digging.
Thanks again.
Hi,
I'm running the cactus examples on toil 5.8.0 and having issues with all my jobs. The error log seems to indicate that the workers can't find their cache file:
My command line is:
Can anyone point me to what the problem might be?
Thanks, Larry.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1288