Open LauraEgolf opened 6 months ago
I repeated the single sample test to ensure that all the necessary changes are synced with GitHub, since some local changes haven't been pushed yet. I used a new base directory: /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
, and cloned the repo from GitHub. I created a new branch (osteo-testing) with a few changes.
The example inputs require the Broad resource files stored in /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/broad-data
. Other than that, I think all the required files are synced with GitHub.
Set up new test run:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514/CCSS_1000278_A_newtest \
-a '{ "single_sample" : "test_single_sample_CCSS_1000278_A", "ref_panel" : "ref_panel_1kg" }'
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
swarm run_gatk-sv_single_sample_no_melt_20240514.swarm
UPDATE: This test failed due to cromwell-related errors. I tested with a different config file (biowulf-swarm.conf instead of biowulf-core.conf), which resolved the problem.
The osteo-testing branch is up-to-date and succeeded for at least one test run - but other test runs failed, as described below.
This is an example of the process I followed to set up multiple runs of the single-sample pipeline, for the 95 osteo WGS samples. We'd want to improve this process if we decide we'll be running the single-sample pipeline a lot (only applicable for small batches of samples - for batches of >100, we would use cohort mode).
Create csv file with sample names and BAM info:
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json
ls /data/DCEG_Pediatric_Sarcomas/GenCompass/pediatric_sarcoma_analysis_id/workflow_results/fq2bam/*/*.bam > bam_list.txt
echo "NAME,SAMPLE_ID,BAM_CRAM" > bam_info.csv
cat bam_list.txt | while read -r bampath; do
samplename=`basename $bampath | sed s/.bam//`
echo ${samplename} >> sample_list.txt # Will use this sample list in later commands
echo ${samplename}","${samplename}","${bampath} >> bam_info.csv
done
rm bam_list.txt
Run Ben's Python script (slightly modified) to create input json files:
# Create a copy of the values directory to output the new json files
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs
cp -r values values_20240514_multisample_test
cd test_create_multisample_json
python gatk-sv_batch_input.py --input bam_info.csv --template single_sample_input_template.json --output_directory ../values_20240514_multisample_test/
Run build_inputs.py (one command per sample) to create build files:
# Example for one sample:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values_20240514_multisample_test \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514_multisample_test/OSTE_OSETO0001579_A \
-a '{ "single_sample" : "OSTE_OSETO0001579_A_input", "ref_panel" : "ref_panel_1kg" }'
### Loop across multiple samples:
BASE_DIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv
sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt
cat $sample_list | while read -r samplename; do
alias_string='{ "single_sample" : "'${samplename}'_input", "ref_panel" : "ref_panel_1kg" }'
${BASE_DIR}/scripts/inputs/build_inputs.py \
${BASE_DIR}/inputs/values_20240514_multisample_test \
${BASE_DIR}/inputs/templates/test/GATKSVPipelineSingleSample \
${BASE_DIR}/inputs/build_20240514_multisample_test/${samplename} \
-a "$alias_string"
done
And, finally, write a swarm command for each sample:
cd /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-core.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test.swarm
done
## UPDATE: use biowulf-swarm.conf instead of biowulf-core.conf (avoid cromwell errors caused by multiple jobs trying to write at the same time)
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-swarm.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig.swarm
done
# Manually added header to swarm file
## Submit first 3 samples in a test swarm
swarm run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig_first3.swarm
Using the swarm file I created (see previous comment), I ran two sets of samples on the single-sample pipeline. (I submitted the first set, then the second set several hours later).
run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig_first3.swarm
(the first 3 samples)run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfig_4to13.swarm
(the next 10 samples)Most of these samples failed (more details below). Of the 13 samples I submitted, only the first sample succeeded.
Also, Biowulf notified me of a "short job" warning (I think I would have received this warning even if the jobs had succeeded):
SHORT JOBS: 32451 [median runtime 0.1 minutes]
From: 2024-05-19 09:02:09
Until 2024-05-20 09:02:09
26632587_0 - Succeeded (11 hr 11 min). Note this was the same sample from my original test.
26632587_1 - Canceled - Swarm job got stuck at AnnotateVcf and continued printing repetitive errors, even though no subjobs were running. There were no subjob failures.
This is the first error that appeared:
[2024-05-20 00:54:34,24] [ESC[38;5;1merrorESC[0m] WriteMetadataActor Failed to properly process data
java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.addBatch(Unknown Source)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.addBatch(HikariProxyPreparedStatement.java)
at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$19(JdbcActionComponent.scala:540)
at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$19$adapted(JdbcActionComponent.scala:538)
at scala.collection.immutable.VectorStatics$.foreachRec(Vector.scala:1895)
at scala.collection.immutable.Vector.foreach(Vector.scala:1901)
at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$18(JdbcActionComponent.scala:538)
at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement(JdbcBackend.scala:427)
at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement$(JdbcBackend.scala:422)
at slick.jdbc.JdbcBackend$BaseSession.withPreparedStatement(JdbcBackend.scala:491)
at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl.preparedInsert(JdbcActionComponent.scala:511)
at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.run(JdbcActionComponent.scala:536)
at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:28)
at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:25)
at slick.dbio.DBIOAction$$anon$1.$anonfun$run$1(DBIOAction.scala:186)
at scala.collection.immutable.Vector.foreach(Vector.scala:1895)
at slick.dbio.DBIOAction$$anon$1.run(DBIOAction.scala:186)
at slick.dbio.DBIOAction$$anon$1.run(DBIOAction.scala:183)
at slick.dbio.SynchronousDatabaseAction$$anon$7.run(DBIOAction.scala:486)
at slick.basic.BasicBackend$DatabaseDef$$anon$3.liftedTree1$1(BasicBackend.scala:276)
at slick.basic.BasicBackend$DatabaseDef$$anon$3.run(BasicBackend.scala:276)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.SessionData.allocateLobForResult(Unknown Source)
at org.hsqldb.Session.allocateResultLob(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.performPreExecute(Unknown Source)
... 24 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.hsqldb.persist.LobStoreMem.setBlockBytes(Unknown Source)
at org.hsqldb.persist.LobManager.setBytesISNormal(Unknown Source)
...
...
(etc.)
26632587_2 - This sample failed at the gCNV step:
[2024-05-19 19:17:35,37] [info] WorkflowManagerActor: Workflow 444fb56c-59fe-4d77-b3a3-11b58932f369 failed (during ExecutingWorkflowState): Job CNVGermlineCaseWorkflow.GermlineCNVCallerCaseMode:64:2 exited with return code 79 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /vf/users/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/cromwell-executions/GATKSVPipelineSingleSample/444fb56c-59fe-4d77-b3a3-11b58932f369/call-GatherBatchEvidence/GatherBatchEvidence/dd07ecc8-88af-4664-8886-1ed0b25beb62/call-gCNVCase/CNVGermlineCaseWorkflow/ec39ed11-8a83-4a31-937e-ef81f84890d8/call-GermlineCNVCallerCaseMode/shard-64/attempt-2/execution/stderr.
[First 3000 bytes]:INFO: Using cached SIF image
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
All 10 samples failed with similar errors. All stopped at the CombineBatches step of MakeCohortVcf. None of the subjobs failed, but the workflow stopped running.
Example error from /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/swarm_logs/swarm_26695501_0.o
:
[2024-05-20 08:18:53,88] [info] WorkflowManagerActor: Workflow c4c13e9b-60b4-4680-bd27-f6a03930e2c3 failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs:
Bad output 'PullVcfShard.count': Failed to read_int("count.txt") (reason 1 of 1): Future timed out after [60 seconds]
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:1040)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I reviewed the other swarm logs, which all had similar errors: grep "Future" swarm_26695501_* | less -S
Ben developed a new config that creates a unique cromwell database per swarm job (cromwell-executions/cromwell-db/cromwell-db-$SLURM_JOB_ID
), which should hopefully avoid the problem of clashing database writes without relying on memory.
I copied this config to /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id.conf
.
Create a new swarm file:
sample_list=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/test_create_multisample_json/sample_list.txt
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID.swarm
done
I also increased the threads per swarm job to 32 because the swarm jobs have been using more than the 8 allocated CPUs. New swarm header:
#SWARM --logdir swarm_logs
#SWARM --threads-per-process 8
#SWARM --gb-per-process 50
#SWARM --time 24:00:00
#SWARM --module cromwell,singularity,GATK
#SWARM --sbatch "--export SINGULARITY_CACHEDIR=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/singularity_cache"
Tested two samples; both stalled due to checkalive issue.
Switched to a new config file to avoid job hanging issue. Uses squeue instead of dashboard_cli to check job status.
biowulf-cromwelldb-slurm-id-checkalive.conf
cat $sample_list | while read -r samplename; do
echo "java -Dconfig.file=/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/config/biowulf-cromwelldb-slurm-id-checkalive.conf -jar \$CROMWELL_JAR run /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/wdl/GATKSVPipelineSingleSample.wdl -o /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/options.json -i /data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/gatk-sv/inputs/build_20240514_multisample_test/"${samplename}"/GATKSVPipelineSingleSample.no_melt.json" >> run_gatk-sv_single_sample_no_melt_20240514_multisample_test_newConfigSlurmID_checkAlive.swarm
done
Submitted 3 samples, which all succeeded.
Submitted 10 additional samples. 6 succeeded, but 4 samples failed during GenotypeBatch because the Docker request limit was exceeded.
Example (swarm_27104311_5.o):
Job GenotypePESRPart2.CatFilesPass:NA:2 exited with return code 79 whi
ch has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /vf/users/DCEG_Pediatric_Sarcomas/StructuralVariantCalling_OsteoWGS2023/cromwell-executions/GATKSVPipelineSingleSample/f422ae72-bcea-42a1-b114-3
d3d241aa0d9/call-GenotypeBatch/GenotypeBatch/3f7d6868-127d-4bc8-8a3e-df2d889e5ef5/call-GenotypePESRPart2/GenotypePESRPart2/58cdc2f0-ac3b-4a59-aa4d-c83ba377b9ed/call-CatFilesPass/attempt-2/execution/stderr.
[First 3000 bytes]:FATAL: Unable to handle docker://ubuntu:18.04 uri: failed to get checksum for docker://ubuntu:18.04: reading manifest 18.04 in docker.io/library/ubuntu: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Additionally, I again received "Short job" warnings from Biowulf:
SHORT JOBS: 13872 [median runtime 0.1 minutes]
From: 2024-05-24 09:02:08
Until 2024-05-25 09:02:08
Scheduling each job has a fixed overhead, so all things being equal
Slurm works more efficiently if the same work is done in fewer jobs
with longer runtimes. Please adjust your workflows to generate jobs
that run for no less than 10-15 minutes, **if possible**.
(1) If your jobs are swarm jobs you can bundle future jobs to raise
the runtimes (swarm "-b" option).
(2) If your short jobs are running python or R it is better to
increase the work each job does rather than just bundling. For
example, if you are running simulations with different seeds, run
multiple simulations per subjob instead of a single simulation.
(3) If these jobs were submitted by a workflow manager consider
running the whole workflow or some of the tasks in local mode, group
tasks, or use other workflow-specific methods to reduce short
jobs.
Of the 9 samples that succeeded, total runtimes were ~1-2 days (much longer than previous tests). It seems the runtime is variable based on how busy Biowulf is, because cromwell spins up many short jobs and have to wait to queue. I recommend setting a walltime limit of at least 72 hours for the swarm.
Will post results of QC review later.
Independent user test of GATK-SV single-sample pipeline on Biowulf
Ben previously configured the GATK-SV single-sample pipeline for Biowulf and tested using a COVNET WGS sample. I ran an independent test using one of the osteosarcoma WGS samples, CCSS_1000278_A.
The single-sample pipeline can process a single test sample jointly with a reference panel. It reduces computational time somewhat by using certain precomputed inputs, but this mode is still is much less computationally efficient than the cohort/batch mode (best used for 100+ samples). Here we used the reference panel of 156 samples from 1000 Genomes which is provided by GATK-SV (the same panel is used in the example Terra workspace).
Note that the single-sample pipeline generally will only work for PCR-free WGS samples, in my experience.
Working directory:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/
Build inputs
Help message for build_inputs.py:
Example from GATK-SV documentation:
Ben's example:
Setup for my test run:
Run pipeline and troubleshoot
First test run:
Got error with GATK jar. Added GATK to the modules list (in swarm command) and re-ran - still got the same error.
Commented this line out of all GATK-SV WDLs (here:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv
):export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override}
. Solved the GATK jar error.Got error with Google container registry (ubuntu image); tested again after initializing gcloud:
Same error.
Update Docker inputs to avoid using GCR ubuntu container
Updated values file to avoid the problematic container:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/gatk-sv/inputs/values_20240405/dockers.json
Replaced: "linux_docker": "marketplace.gcr.io/google/ubuntu1804", With: "linux_docker": "ubuntu:18.04",
Re-ran build_inputs.py, and submitted test run - latest run succeeded.
Results
Runtime
Runtime estimate from the example Terra workspace, I assume this is using GCP preemptible instances:
Note the 18 GB file size is based on a ~30x CRAM file, here we used a 36x BAM file (77 GB).
Runtime on Biowulf: 9 hours 13 minutes (based on the main swarm job runtime, and the timestamps on the output directory).
Output structure
An explanation of the output can be found on the Terra workspace for the GATK-SV single-sample pipeline:
Results from this run:
/data/DCEG_Pediatric_Sarcomas/StructuralVariantCalling/run_gatksv_20240405/gatk-sv-results
QC checks
The GATK-SV single-sample pipeline has several built-in QC checks. This sample mostly passed the QC checks, which is a good sign (I've tried running this pipeline in the past, and my PCR+ samples failed the QC checks horribly). However, it was flagged for certain SV counts being outside the 'normal' ranges defined by the pipeline developers:
In particular, the count of deletions >100 kb is extremely high. I will investigate this further.