Is the pipeline idempotent?

VJalili commented 3 years ago

There is a small discrepancy between multiple executions of GATKSVPipelineSingleSample on the same input with the same parameters. The discrepancy is observed in the following files:

FinalVCFCleanup.out
Module08Annotation.output_vcf
FilterSample.out
ResetPESRTGTOverdispersionFilter.out
ResetBothsidesSupportFilter.out
ResetHighSRBackgroundFilter.out
FilterVcfWithReferencePanelCalls.out
FilterVcfForCaseSampleGenotype.out
FilterVcfDepthLt5kb.out
GetUniqueNonGenotypedDepthCalls.out
Module0506.vcf
Module0506.complex_genotype_vcf
ConvertCNVsWithoutDepthSupportToBNDs.out_vcf
Module04.genotyped_pesr_vcf
Module04.genotyped_depth_vcf
RewriteSRCoords.annotated_vcf
FilterLargePESRCallsWithoutRawDepthSupport.out
MergePesrVcfs.merged_pesr_vcf
FilterWham.out
Module01.wham_vcf
Module00c.std_wham_vcf
Module00a.wham_vcf

It seems the discrepancy originates from the Whamg workflow. The workflow uses whamg (whamg source code), which seems to have some randomness involved (https://github.com/zeeev/wham/issues/51).

I ran the GATKSVPipelineSingleSample workflow multiple times without call-caching (i.e., setting "read_from_cache" : false and "write_to_cache" : false in the options files), and ran svtest vcf on the final VCFs of the separate executions comparing all to a common baseline. If the pipeline was idempotent, I would expect the same difference/similarity to the baseline in all the executions, however, the executions slightly differ from each other compared to the baseline. Some of the output metrics of svtest vcf comparing a test run with a given UUID vs. the baseline are as follows.

	5700c10e-8db3-477f-9ea0-2cef381a8108	829d3048-9a91-4fe3-ae5b-7549cd6e50f8	b6a0c977-8de6-481d-b58d-c9d6d182db2f
final_cleanup_vcf_DEL_tp	4236	4233	4242
final_cleanup_vcf_DEL_fp	9	13	10
final_cleanup_vcf_DEL_fn	21	26	18
final_cleanup_vcf_DUP_tp	1351	1356	1356
final_cleanup_vcf_DUP_fp	16	11	11
final_cleanup_vcf_DUP_fn	13	8	8
final_cleanup_vcf_INS_tp	2442	2446	2444
final_cleanup_vcf_INS_fp	5	5	3
final_cleanup_vcf_INS_fn	16	12	14
final_cleanup_vcf_INV_tp	9	9	9
final_cleanup_vcf_INV_fp	0	0	0
final_cleanup_vcf_INV_fn	0	0	0
final_cleanup_vcf_BND_tp	1397	1395	1397
final_cleanup_vcf_BND_fp	3	1	3
final_cleanup_vcf_BND_fn	5	7	4
final_cleanup_vcf_CPX_tp	83	84	84
final_cleanup_vcf_CPX_fp	14	13	12
final_cleanup_vcf_CPX_fn	1	0	0
final_cleanup_vcf_CNV_tp	198	198	198
final_cleanup_vcf_CNV_fp	0	0	0
final_cleanup_vcf_CNV_fn	0	0	0

mwalker174 commented 2 years ago

Also svtk's pe-test and sr-test tools do not set PRNG seeds.

epiercehoffman commented 1 year ago

Also missing random seed in svtk adjudicate for subsampling function https://github.com/broadinstitute/gatk-sv/blob/main/src/svtk/svtk/adjudicate/random_forest.py#L82 (and 86)

VJalili commented 1 year ago

The pipeline's Wham-related irreproducible output is related to using a random number generator with a floating seed in the Wham source code. The steps we took to fix this issue are:

We reported the issue on Wham's GitHub repository: https://github.com/zeeev/wham/issues/51
I made a pull request that sets a fixed seed to the random number generator: https://github.com/zeeev/wham/pull/62
I tried updating Wham's docker image to include a version of Wham with a fixed seed by updating Wham's Dockerfile:
- The current Docker image is built using a compiled version of Wham; hence I tried recompiling from https://github.com/VJalili/wham/tree/fix-seed, where the seed of the random number generator is fixed.
- The build process failed for reasons similar to those reported on Wham's GitHub page: https://github.com/zeeev/wham/issues/56
- The make process of Wham includes building bamtools from an older version, which fails to compile on Ubuntu 18 (and all the newer versions).
- The referenced version of bamtools seemingly depends on older versions of the built environment (e.g., gcc) that are not included in the Ubuntu 18 docker image. I explored options for installing older compiler versions, explicitly trying gcc-4.x and gcc-5, but bamtools failed to compile with these versions.
- I tried installing bamtools from apt (apt-get install -y libbamtools-dev) to avoid the build-related issues and removed bamtools from Wham's make process. Wham failed to compile with this approach.
- I tried replacing the cloned version of bamtools (fetched using git clone --recursive) with (a) current HEAD and (b) older commits (e.g., HEAD of the master branch in 2019). With this approach, bamtools compiles successfully, but Wham's make continues to fail.

broadinstitute / gatk-sv

Is the pipeline idempotent? #201