DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

help with aws_batch #4377

Open larryns opened 1 year ago

larryns commented 1 year ago

Hi,

I'm running the cactus examples on toil 5.8.0 and having issues with all my jobs. The error log seems to indicate that the workers can't find their cache file:

[2023-02-07T15:04:27+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
Exit reason: BatchJobExitReason.FAILED
[2023-02-07T15:04:27+0000] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v2
[2023-02-07T15:04:27+0000] [MainThread] [W] [toil.leader] Log from job "ae54b352-bd8b-4bd8-a031-d58cc1b82488" follows:
=========>
        [2023-02-07T15:04:22+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2023-02-07T15:04:22+0000] [MainThread] [I] [toil] Running Toil version 5.8.0-79792b70098c4c18d1d2c2832b72085893f878d1 on host ip-172-20-21-26.ec2.internal.
        [2023-02-07T15:04:22+0000] [MainThread] [I] [toil.worker] Working on job 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
        [2023-02-07T15:04:24+0000] [MainThread] [I] [toil.worker] Loaded body Job('PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1) from description 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
        [2023-02-07T15:04:24+0000] [MainThread] [I] [toil.statsAndLogging] Preparing sequence for preprocessing
        [2023-02-07T15:04:24+0000] [MainThread] [I] [toil.statsAndLogging] Chunks = ['/var/lib/toil/42da40e50de8550a9044b9881a6cd594/5c45/e819/tmpdex7dpmu.tmp']
        [2023-02-07T15:04:24+0000] [MainThread] [I] [toil.job] Saving graph of 2 jobs, 1 new
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.job] Processing job 'CutHeadersJob' dc6f478e-7961-4552-8c3a-b010ad4008b0 v0
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.job] Processing job 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.worker] Completed body for 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.worker] Not chaining from job 'PreprocessSequence' ae54b352-bd8b-4bd8-a031-d58cc1b82488 v1
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.worker] Worker log can be found at /var/lib/toil/42da40e50de8550a9044b9881a6cd594/5c45. Set --cleanWorkDir to retain this log
        [2023-02-07T15:04:25+0000] [MainThread] [I] [toil.worker] Finished running the chain of jobs on this node, we ran for a total of 3.252878 seconds
        Exception in thread Thread-11:
        Traceback (most recent call last):
          File "/usr/lib/python3.9/threading.py", line 980, in _bootstrap_inner
            self.run()
          File "/usr/lib/python3.9/threading.py", line 917, in run
            self._target(*self._args, **self._kwargs)
          File "/usr/local/lib/python3.9/dist-packages/toil/fileStores/cachingFileStore.py", line 1822, in startCommitThread
            self._executePendingUploads(con, cur)
          File "/usr/local/lib/python3.9/dist-packages/toil/fileStores/cachingFileStore.py", line 763, in _executePendingUploads
            self.jobStore.update_file(fileID, filePath)
          File "/usr/local/lib/python3.9/dist-packages/toil/jobStores/aws/jobStore.py", line 536, in update_file
            info.upload(local_path, not self.config.disableJobStoreChecksumVerification)
          File "/usr/local/lib/python3.9/dist-packages/toil/jobStores/aws/jobStore.py", line 1091, in upload
            file_size, file_time = fileSizeAndTime(localFilePath)
          File "/usr/local/lib/python3.9/dist-packages/toil/jobStores/aws/utils.py", line 191, in fileSizeAndTime
            file_stat = os.stat(localFilePath)
        FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/toil/42da40e50de8550a9044b9881a6cd594/cache-67b1bb50-2355-4a3c-9571-490cefa4ecf8/tmplek36_7hb848d4de445426cdfd147f751d2ac8f17770db81'
<=========

My command line is:

        --consCores 8 \
        --awsBatchQueue ****** \
        --awsBatchRegion us-east-1 \
        --provisioner aws \
        --batchSystem aws_batch \
        --nodeStorage 150 \
        --cleanWorkDir onSuccess \
        --writeLogs /cactus/examples/logs \
        --defaultDisk 6G \
        --realTimeLogging \
        aws:us-east-1:cactus-examples \
        evolverMammals.txt out.hal \
        > cactus.o 2>cactus.e

Can anyone point me to what the problem might be?

Thanks, Larry.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1288

adamnovak commented 1 year ago

We've only ever used the AWS Batch system as part of running Toil under AWS Genomics CLI, which can use it to set up a WES server to run CWL workflows.

In that environment, the Toil WES server is run under this script, and that script sets up workflows to always run with --disableCaching to turn off the caching system.

We set it up that way because AWS Batch doesn't actually support scheduling jobs onto nodes while accounting for disk space as a resource, and AGC deals with that with a daemon on the nodes that tries to grow the nodes' local disks faster than they can fill. The Toil caching system assumes that the amount of space on the filesystem when it starts up is the amount it is going to have available, and doesn't expect the disks to grow like that, and we haven't wanted to think about how to tell if and when you want to evict things from the cache if the disk just grows indefinitely.

If you add --disableCaching to the workflow command line, it should work around this particular problem. As long as you also have automatically-growing disks on your nodes, or your nodes all have local disks that are already big enough to handle running all the jobs in your workflow that would otherwise fit to schedule on them, then it ought to work.

If your disks aren't resizing themselves, I'm not sure why the caching system wouldn't work just because you are on AWS Batch. It looks like, when the caching system goes to upload a particular file that this job wrote, at the end of the job, the file has vanished and is no longer there to be copied to S3. But nothing in the log indicates where it has gone or who might have deleted it; for that we might need the debug-level logs for the whole workflow.

@larryns How exactly are you setting up AWS Batch to use it with Toil, if you aren't going through AGC? Is there a particular AMI you are using as a host machine? Do you have some kind of disk-growing daemon? Are you using some kind of AWS-provided fancy elastic filesystem to back /var/lib/toil on the host?

larryparatus commented 1 year ago

Thanks for the info @adamnovak To give you a little more background. I'd gotten the aws_batch to work with the examples, but then upgraded from 5.8.0 to 5.9.2 and something broke. I downgraded back to 5.8.0 and couldn't get it to work again. For the life of me, I can't figure out what's different.

We're not using AGC; the batch environments and queues were set up pretty much through the console. We're using ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220912 for the AMI (ami-08c40ec9ead489470) on the host node. I don't believe there's anything out of the ordinary for the disk other than basic EBS. But I'll do some more digging.

Thanks again.