Closed mr-c closed 2 years ago
I think that the relevant error may be:
cwltool.errors.WorkflowException: Singularity is not available for this tool, try --no-container to disable Singularity, or install a user space Docker replacement like uDocker with --user-space-docker-cmd.: Command '['singularity', 'pull', '--force', '--name', 'debian:stretch-slim.sif', 'docker://debian:stretch-slim']' returned non-zero exit status 255.
So singularity may need some extra tweaking.
I'm not sure where you got that message from @DailyDreaming.
This looks like the same basic problem as I'm complaining about in https://github.com/DataBiosphere/toil/pull/3802#issuecomment-964646207
Some of the CWL tests are timing out:
Test 258 timed out: toil-cwl-runner --disableCaching=True --clean=always --logDebug --setEnv=SINGULARITY_DOCKER_HUB_MIRROR --batchSystem=kubernetes --outdir=/tmp/tmpbemuf758 --quiet tests/string-interpolation/bash-line-continuation.cwl
Line continuations in bash scripts should behave correctly
Terminating lingering process
Test 291 timed out: toil-cwl-runner --disableCaching=True --clean=always --logDebug --setEnv=SINGULARITY_DOCKER_HUB_MIRROR --batchSystem=kubernetes --outdir=/tmp/tmpnsm2kvzl --quiet tests/conditionals/cond-wf-003.1_nojs.cwl tests/conditionals/second-true.yml
pickValue: first_non_null second item is non null; no javascript
Test 80 timed out: toil-cwl-runner --disableCaching=True --clean=always --logDebug --setEnv=SINGULARITY_DOCKER_HUB_MIRROR --batchSystem=kubernetes --outdir=/tmp/tmp30r2oy9y --quiet 'tests/scatter-valuefrom-wf3.cwl#main' tests/scatter-valuefrom-job2.json
Test workflow scatter with two scatter parameters and flat_crossproduct join method and valueFrom on step input
Terminating lingering process
Test 191 timed out: toil-cwl-runner --disableCaching=True --clean=always --logDebug --setEnv=SINGULARITY_DOCKER_HUB_MIRROR --batchSystem=kubernetes --outdir=/tmp/tmpmlrcye6p --quiet tests/cat-from-dir.cwl tests/cat-from-dir-job.yaml
Pipe to stdin from user provided local File via a Directory literal
Terminating lingering process
It looks like different tests time out in different runs, and the test run as a whole is getting to what looks like the right total number of tests.
The only thing I can think of to blame here is how I recently changed the GI Kubernetes scheduler to put jobs on busier nodes instead of less busy nodes, to prepare for autoscaling. We could try and revert that change and see if the Toil CI tests start passing? We could also watch Kubernetes while one of these runs is running to see if jobs are sitting around in queue.
@mr-c How long does the CWL test harness give each test to run, and how accurate are the resource requirements on all the CWL test jobs? Will they all finish in time if given the resources they ask for (with cores really being hyperthread half-cores) and no more?
We might need to remove the --quiet
from the cwl-toil
invocations to actually get the logs for the timed-out tests, so we can see what if anything is going wrong. I'm not sure if those logs will clobber each other given that we're running several of these tests in parallel; I don't know if the CWL test harness does anything to de-interleave the logs from multiple tests.
@mr-c I think --quiet
is coming from cwltest
, but it shouldn't be because we are passing --verbose
to cwltest
. Can you figure out if we're just using an old cwltest
version or something?
@adamnovak The message came not from master
but a branch where I reverted the last commit (TES): https://ucsc-ci.com/databiosphere/toil/-/jobs/1310/raw
We might just be hitting docker limits (as a separate issue) there then. Hopefully the mirror is working but we'll see if the error re-occurs in testing future commits.
My PR #3891 where I break --quiet
to get logs didn't actually have any CWL Kubernetes timeouts. But the Google job store tests in the next stage also fail:
______________________ GoogleJobStoreTest.testBatchCreate ______________________
Traceback (most recent call last):
File "/builds/TnycNQM8/2/databiosphere/toil/src/toil/test/jobStores/jobStoreTest.py", line 125, in setUp
self.jobstore_initialized.initialize(self.config)
File "/builds/TnycNQM8/2/databiosphere/toil/src/toil/jobStores/googleJobStore.py", line 81, in wrapper
return f(*args, **kwargs)
File "/builds/TnycNQM8/2/databiosphere/toil/src/toil/jobStores/googleJobStore.py", line 134, in initialize
self.bucket = self.storageClient.create_bucket(self.bucketName)
File "/builds/TnycNQM8/2/databiosphere/toil/venv/lib/python3.8/site-packages/google/cloud/storage/client.py", line 229, in create_bucket
bucket.create(client=self)
File "/builds/TnycNQM8/2/databiosphere/toil/venv/lib/python3.8/site-packages/google/cloud/storage/bucket.py", line 276, in create
api_response = client._connection.api_request(
File "/builds/TnycNQM8/2/databiosphere/toil/venv/lib/python3.8/site-packages/google/cloud/_http.py", line 293, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 POST https://www.googleapis.com/storage/v1/b?project=toil-dev: toil-498@toil-dev.iam.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project.
Looks like the new credentials we issued don't have the permissions they need.
@mr-c I think
--quiet
is coming fromcwltest
, but it shouldn't be because we are passing--verbose
tocwltest
. Can you figure out if we're just using an oldcwltest
version or something?
Hmm.. toil is using latest cwltest
, 2.2.20210901154959
, so I don't know where that is coming from
Turning down the parallelism made this less likely to happen, I think, but it still happens.
@mr-c fixed the XML generation, so here's the log from Simple scatter: Add conditional variable to scatter; no javascript
timing out in https://ucsc-ci.com/databiosphere/toil/-/jobs/2448:
I have some evidence from #3936 that this might be caused by jobs getting lost. I got a log that looks like this:
[2021-11-19T20:58:03+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2021-11-19T20:58:04+0000] [Thread-12 ] [D] [toil.jobStores.aws.jobStore] Starting sha1 checksum to match 9bf88d90e19feff90e116296b35668040cabac00
[2021-11-19T20:58:04+0000] [Thread-12 ] [D] [toil.jobStores.aws.jobStore] Completed checksum with hash 9bf88d90e19feff90e116296b35668040cabac00 vs. expected 9bf88d90e19feff90e116296b35668040cabac00
[2021-11-19T20:58:04+0000] [Thread-10 ] [D] [toil.statsAndLogging] Got message from job at time 11-19-2021 20:58:04: Job 'CWLJob' echo c9609463-4cf4-4433-aaeb-9537648405e3 v1 used 0.00% disk (4.0 KiB [4096B] used, 3.0 GiB [3221225472B] requested).
[2021-11-19T20:58:04+0000] [Thread-10 ] [D] [toil.statsAndLogging] Received Toil worker log. Disable debug level logging to hide this output
[2021-11-19T20:58:04+0000] [Thread-10 ] [D] [toil.statsAndLogging] Log from job "'CWLJob' echo c9609463-4cf4-4433-aaeb-9537648405e3 v1" follows:
=========>
...
<=========
[2021-11-19T20:58:13+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2021-11-19T20:58:23+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2021-11-19T20:58:34+0000] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
We have a job sent to run on Kubernetes, and it returns an (apparently successful) log, and stops showing up in the list of running jobs, but the Kubernetes batch system never spits it out as an "updated" job when it finishes. I think something happens to the job on the Kubernetes side, and it gets dropped from Kubernetes after it finishes despite not having a TTL set on it, because it's not showing up in my code that's supposed to show me a job that exists.
I'm now trying setting the missing job rescuer to run frequently enough that the CWL tests will actually have it operate on them, and I've expanded logging a bit more. Perhaps this will produce some new insights.
I did some experiments with manually created jobs, and they are definitely getting cleaned up after a few minutes on our cluster, with no TTL. Our admin Erich Weiler says we do have a job cleanup bot, but it is supposed to strike when the job has been sitting around completed for days.
rm -f jobs.yaml
for ITER in {1..1} ; do
echo "Add ${ITER}"
cat >>jobs.yaml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
generateName: adamnovak-floodtest-
labels:
app: adamnovak-test
spec:
template:
spec:
containers:
- name: main
image: mirror.gcr.io/library/ubuntu:20.04
command: ["echo", "test"]
resources:
limits:
memory: 100M
cpu: 100m
ephemeral-storage: 1G
requests:
memory: 100M
cpu: 100m
ephemeral-storage: 1G
restartPolicy: Never
backoffLimit: 0
---
EOF
done
kubectl create -n toil -f jobs.yaml >create-log.txt 2>&1
cat create-log.txt | while read LINE ; do
JOB_NAME=$(echo ${LINE} | grep -o "adamnovak-floodtest-[0-9a-zA-Z]*")
TIMES_SEEN=1
while true ; do
kubectl -n toil get job "${JOB_NAME}" -o yaml >job-seen.tmp
if [[ "${?}" != "0" ]] ; then
echo "Job ${JOB_NAME} reported as created but is missing"
rm job-seen.tmp
break
else
echo "Job ${JOB_NAME} still exists; seen ${TIMES_SEEN} times"
mv job-seen.tmp job-seen.yaml
sleep 1
TIMES_SEEN=$((TIMES_SEEN+1))
fi
done
done
Actually adding a ttlSecondsAfterFinished
to the job spec seems to protect them from being cleaned up before then, so I'm trying teaching Toil to set a long enough TTL that it has time to get back to the finished jobs before they are cleaned up.
This is back again in https://ucsc-ci.com/databiosphere/toil/-/jobs/3545
I think adding the TTL helped, and our cluster just has a chaos gremlin that deletes some finished jobs no matter what we do. We could switch the whole Kubernetes batch system over to using a watcher thread and subscribing to updates, or we can just make the lost job rescue code run often enough that it can kick in in the CWL conformance tests.
But not the normally run CWL conformance tests
Failed twice: https://ucsc-ci.com/databiosphere/toil/-/jobs/1159#L907 https://ucsc-ci.com/databiosphere/toil/-/jobs/1102#L907
Anything thoughts @adamnovak ?
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1075