NCI-CGR / gatk-sv

A structural variation pipeline for short-read sequencing - modified to run on HPC
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Single-sample pipeline testing on Biowulf #1

Open LauraEgolf opened 2 months ago

LauraEgolf commented 2 months ago

Independent user test of GATK-SV single-sample pipeline on Biowulf

Ben previously configured the GATK-SV single-sample pipeline for Biowulf and tested using a COVNET WGS sample. I ran an independent test using one of the osteosarcoma WGS samples, CCSS_1000278_A.

The single-sample pipeline can process a single test sample jointly with a reference panel. It reduces computational time somewhat by using certain precomputed inputs, but this mode is still is much less computationally efficient than the cohort/batch mode (best used for 100+ samples). Here we used the reference panel of 156 samples from 1000 Genomes which is provided by GATK-SV (the same panel is used in the example Terra workspace).

Note that the single-sample pipeline generally will only work for PCR-free WGS samples, in my experience.

Working directory: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/

Build inputs

Help message for build_inputs.py:

positional arguments:  
    input_values_directory Directory containing input value map JSON files
    template_path          Path to template directory or file (directories will be processed recursively)
    output_directory       Directory to create output files in

optional arguments:
    -h, --help          show this help message and exit
    -a ALIASES          Aliases for input value bundles
    --log-info          Show INFO-level logging messages. Use for troubleshooting.

Example from GATK-SV documentation:

# Build test files for the single-sample workflow
python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test/GATKSVPipelineSingleSample \
    inputs/build/NA19240/test_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'

Ben's example:

BASE_DIR=/data/COVID_WGS/StructuralVariantCalling/gatk-sv 
scripts/inputs/build_inputs.py \
    ${BASE_DIR}/inputs/values \
    ${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
    ${BASE_DIR}/inputs/build/SC695914/test \
    -a '{ "single_sample" : "test_single_sample_SC695914.json", "ref_panel" : "ref_panel_1kg" }'

Setup for my test run:

BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv 
${BASE_DIR}/scripts/inputs/build_inputs.py \
    ${BASE_DIR}/inputs/values_20240405 \
    ${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
    ${BASE_DIR}/inputs/build_20240405/CCSS_1000278_A \
    -a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'

Run pipeline and troubleshoot

First test run:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
swarm run_gatk-sv_single_sample_no_melt.swarm

Got error with GATK jar. Added GATK to the modules list (in swarm command) and re-ran - still got the same error.

Commented this line out of all GATK-SV WDLs (here: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv): export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}. Solved the GATK jar error.

Got error with Google container registry (ubuntu image); tested again after initializing gcloud:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405
sinteractive  # Can't load google-cloud-sdk on login nodes
module load google-cloud-sdk
gcloud init  # Follow prompts - login (egolfle@nih.gov), select project (nih-nci-dceg-covnet-wgs)
swarm run_gatk-sv_single_sample_no_melt.swarm

Same error.

Update Docker inputs to avoid using GCR ubuntu container

Updated values file to avoid the problematic container:

/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv/inputs/values_20240405/dockers.json

Replaced: "linux_docker": "marketplace.gcr.io/google/ubuntu1804", With: "linux_docker": "ubuntu:18.04",

Re-ran build_inputs.py, and submitted test run - latest run succeeded.

Results

Runtime

Runtime estimate from the example Terra workspace, I assume this is using GCP preemptible instances:

image

Note the 18 GB file size is based on a ~30x CRAM file, here we used a 36x BAM file (77 GB).

Runtime on Biowulf: 9 hours 13 minutes (based on the main swarm job runtime, and the timestamps on the output directory).

Output structure

An explanation of the output can be found on the Terra workspace for the GATK-SV single-sample pipeline:

image

Results from this run: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/gatk-sv-results

QC checks

The GATK-SV single-sample pipeline has several built-in QC checks. This sample mostly passed the QC checks, which is a good sign (I've tried running this pipeline in the past, and my PCR+ samples failed the QC checks horribly). However, it was flagged for certain SV counts being outside the 'normal' ranges defined by the pipeline developers:

image

In particular, the count of deletions >100 kb is extremely high. I will investigate this further.

LauraEgolf commented 2 months ago

"Clean" test to check GitHub repo

I repeated the single sample test to ensure that all the necessary changes are synced with GitHub, since some local changes haven't been pushed yet. I used a new base directory: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023, and cloned the repo from GitHub. I created a new branch (osteo-testing) with a few changes.

The example inputs require the Broad resource files stored in /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/broad-data. Other than that, I think all the required files are synced with GitHub.

Set up new test run:

BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
  ${BASE_DIR}/inputs/values \
  ${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
  ${BASE_DIR}/inputs/build_20240514/CCSS_1000278_A_newtest \
  -a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
swarm run_gatk-sv_single_sample_no_melt_20240514.swarm

UPDATE: This test failed due to cromwell-related errors. I tested with a different config file (biowulf-swarm.conf instead of biowulf-core.conf), which resolved the problem.

The osteo-testing branch is up-to-date and succeeded for at least one test run - but other test runs failed, as described below.

LauraEgolf commented 2 months ago

Setting up inputs for multiple runs of the single-sample pipeline

This is an example of the process I followed to set up multiple runs of the single-sample pipeline, for the 95 osteo WGS samples. We'd want to improve this process if we decide we'll be running the single-sample pipeline a lot (only applicable for small batches of samples - for batches of >100, we would use cohort mode).

Create csv file with sample names and BAM info:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json

ls /data/DCEG_Pediatric_Sarcomas/GenCompass/pediatric_sarcoma_analysis_id/workflow_results/fq2bam/*/*.bam > bam_list.txt

echo "NAME,SAMPLE_ID,BAM_CRAM" > bam_info.csv

cat bam_list.txt | while read -r bampath; do
  samplename=`basename $bampath | sed s/.bam//`
  echo ${samplename} >> sample_list.txt   # Will use this sample list in later commands
  echo ${samplename}","${samplename}","${bampath} >> bam_info.csv
done

rm bam_list.txt

Run Ben's Python script (slightly modified) to create input json files:

# Create a copy of the values directory to output the new json files
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs
cp -r values values_20240514_multisample_test

cd test_create_multisample_json
python gatk-sv_batch_input.py --input bam_info.csv --template single_sample_input_template.json --output_directory ../values_20240514_multisample_test/

Run build_inputs.py (one command per sample) to create build files:

# Example for one sample: 
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
  ${BASE_DIR}/inputs/values_20240514_multisample_test \
  ${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
  ${BASE_DIR}/inputs/build_20240514_multisample_test/OSTE_OSETO0001579_A \
  -a '{ "single_sample" : "OSTE_OSETO0001579_A_input", "ref_panel" : "ref_panel_1kg" }'

###  Loop across multiple samples:

BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt 

cat $sample_list | while read -r samplename; do
  alias_string='{ "single_sample" : "'${samplename}'_input", "ref_panel" : "ref_panel_1kg" }'

  ${BASE_DIR}/scripts/inputs/build_inputs.py \
    ${BASE_DIR}/inputs/values_20240514_multisample_test \
    ${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
    ${BASE_DIR}/inputs/build_20240514_multisample_test/${samplename} \
    -a "$alias_string"
done

And, finally, write a swarm command for each sample:

cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023

cat $sample_list | while read -r samplename; do
  echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-core.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test.swarm
done

## UPDATE: use biowulf-swarm.conf instead of biowulf-core.conf (avoid cromwell errors caused by multiple jobs trying to write at the same time)
cat $sample_list | while read -r samplename; do
  echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-swarm.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig.swarm
done

# Manually added header to swarm file

## Submit first 3 samples in a test swarm
swarm run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig_first3.swarm
LauraEgolf commented 2 months ago

Results from testing multiple single-sample runs

Using the swarm file I created (see previous comment), I ran two sets of samples on the single-sample pipeline. (I submitted the first set, then the second set several hours later).

Most of these samples failed (more details below). Of the 13 samples I submitted, only the first sample succeeded.

Also, Biowulf notified me of a "short job" warning (I think I would have received this warning even if the jobs had succeeded):

SHORT JOBS:  32451 [median runtime 0.1 minutes]
From:        2024-05-19 09:02:09
Until        2024-05-20 09:02:09

First set (n=3 samples)

26632587_0 - Succeeded (11 hr 11 min). Note this was the same sample from my original test.

26632587_1 - Canceled - Swarm job got stuck at AnnotateVcf and continued printing repetitive errors, even though no subjobs were running. There were no subjob failures.

This is the first error that appeared:

[2024-05-20 00:54:34,24] [ESC[38;5;1merrorESC[0m] WriteMetadataActor Failed to properly process data
java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.addBatch(Unknown Source)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.addBatch(HikariProxyPreparedStatement.java)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$19(JdbcActionComponent.scala:540)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$19$adapted(JdbcActionComponent.scala:538)
        at scala.collection.immutable.VectorStatics$.foreachRec(Vector.scala:1895)
        at scala.collection.immutable.Vector.foreach(Vector.scala:1901)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$18(JdbcActionComponent.scala:538)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement(JdbcBackend.scala:427)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement$(JdbcBackend.scala:422)
        at slick.jdbc.JdbcBackend$BaseSession.withPreparedStatement(JdbcBackend.scala:491)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl.preparedInsert(JdbcActionComponent.scala:511)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.run(JdbcActionComponent.scala:536)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:28)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:25)
        at slick.dbio.DBIOAction$$anon$1.$anonfun$run$1(DBIOAction.scala:186)
        at scala.collection.immutable.Vector.foreach(Vector.scala:1895)
        at slick.dbio.DBIOAction$$anon$1.run(DBIOAction.scala:186)
        at slick.dbio.DBIOAction$$anon$1.run(DBIOAction.scala:183)
        at slick.dbio.SynchronousDatabaseAction$$anon$7.run(DBIOAction.scala:486)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.liftedTree1$1(BasicBackend.scala:276)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.run(BasicBackend.scala:276)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.SessionData.allocateLobForResult(Unknown Source)
        at org.hsqldb.Session.allocateResultLob(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.performPreExecute(Unknown Source)
        ... 24 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.persist.LobStoreMem.setBlockBytes(Unknown Source)
        at org.hsqldb.persist.LobManager.setBytesISNormal(Unknown Source)
        ...
        ...
        (etc.)

26632587_2 - This sample failed at the gCNV step:

[2024-05-19 19:17:35,37] [info] WorkflowManagerActor: Workflow 444fb56c-59fe-4d77-b3a3-11b58932f369 failed (during ExecutingWorkflowState): Job CNVGermlineCaseWorkflow.GermlineCNVCallerCaseMode:64:2 exited with return code 79 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /vf/users/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/cromwell-executions/GATKSVPipelineSingleSample/444fb56c-59fe-4d77-b3a3-11b58932f369/call-GatherBatchEvidence/GatherBatchEvidence/dd07ecc8-88af-4664-8886-1ed0b25beb62/call-gCNVCase/CNVGermlineCaseWorkflow/ec39ed11-8a83-4a31-937e-ef81f84890d8/call-GermlineCNVCallerCaseMode/shard-64/attempt-2/execution/stderr.
 [First 3000 bytes]:INFO:    Using cached SIF image
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort

Second set (n=10 samples)

All 10 samples failed with similar errors. All stopped at the CombineBatches step of MakeCohortVcf. None of the subjobs failed, but the workflow stopped running.

Example error from /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/swarm_logs/swarm_26695501_0.o:

[2024-05-20 08:18:53,88] [info] WorkflowManagerActor: Workflow c4c13e9b-60b4-4680-bd27-f6a03930e2c3 failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs:
Bad output 'PullVcfShard.count': Failed to read_int("count.txt") (reason 1 of 1): Future timed out after [60 seconds]
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:1040)
        at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I reviewed the other swarm logs, which all had similar errors: grep "Future" swarm_26695501_* | less -S

LauraEgolf commented 2 months ago

Test multiple runs again with unique cromwell database

Ben developed a new config that creates a unique cromwell database per swarm job (cromwell-executions/cromwell-db/cromwell-db-$SLURM_JOB_ID), which should hopefully avoid the problem of clashing database writes without relying on memory.

I copied this config to /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id.conf.

Create a new swarm file:

sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt 
cat $sample_list | while read -r samplename; do
  echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID.swarm
done

I also increased the threads per swarm job to 32 because the swarm jobs have been using more than the 8 allocated CPUs. New swarm header:

#SWARM --logdir swarm_logs
#SWARM --threads-per-process 8
#SWARM --gb-per-process 50
#SWARM --time 24:00:00
#SWARM --module cromwell,singularity,GATK
#SWARM --sbatch "--export SINGULARITY_CACHEDIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/singularity_cache"

Tested two samples; both stalled due to checkalive issue.

Update config to use squeue to check if job is alive

Switched to a new config file to avoid job hanging issue. Uses squeue instead of dashboard_cli to check job status. biowulf-cromwelldb-slurm-id-checkalive.conf

cat $sample_list | while read -r samplename; do
  echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id-checkalive.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID_checkAlive.swarm
done

Test runs and errors

Submitted 3 samples, which all succeeded.

Submitted 10 additional samples. 6 succeeded, but 4 samples failed during GenotypeBatch because the Docker request limit was exceeded.

Example (swarm_27104311_5.o):

Job GenotypePESRPart2.CatFilesPass:NA:2 exited with return code 79 whi
ch has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /vf/users/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/cromwell-executions/GATKSVPipelineSingleSample/f422ae72-bcea-42a1-b114-3
d3d241aa0d9/call-GenotypeBatch/GenotypeBatch/3f7d6868-127d-4bc8-8a3e-df2d889e5ef5/call-GenotypePESRPart2/GenotypePESRPart2/58cdc2f0-ac3b-4a59-aa4d-c83ba377b9ed/call-CatFilesPass/attempt-2/execution/stderr.
 [First 3000 bytes]:FATAL:   Unable to handle docker://ubuntu:18.04 uri: failed to get checksum for docker://ubuntu:18.04: reading manifest 18.04 in docker.io/library/ubuntu: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Additionally, I again received "Short job" warnings from Biowulf:

SHORT JOBS:  13872 [median runtime 0.1 minutes]
From:        2024-05-24 09:02:08
Until        2024-05-25 09:02:08

Scheduling each job has a fixed overhead, so all things being equal
Slurm works more efficiently if the same work is done in fewer jobs
with longer runtimes. Please adjust your workflows to generate jobs
that run for no less than 10-15 minutes, **if possible**.

(1) If your jobs are swarm jobs you can bundle future jobs to raise
the runtimes (swarm "-b" option).

(2) If your short jobs are running python or R it is better to
increase the work each job does rather than just bundling. For
example, if you are running simulations with different seeds, run
multiple simulations per subjob instead of a single simulation.

(3) If these jobs were submitted by a workflow manager consider
running the whole workflow or some of the tasks in local mode, group
tasks, or use other workflow-specific methods to reduce short
jobs.

Runtime

Of the 9 samples that succeeded, total runtimes were ~1-2 days (much longer than previous tests). It seems the runtime is variable based on how busy Biowulf is, because cromwell spins up many short jobs and have to wait to queue. I recommend setting a walltime limit of at least 72 hours for the swarm.

QC Checks

Will post results of QC review later.