broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
170 stars 70 forks source link

Is the pipeline idempotent? #201

Open VJalili opened 3 years ago

VJalili commented 3 years ago

There is a small discrepancy between multiple executions of GATKSVPipelineSingleSample on the same input with the same parameters. The discrepancy is observed in the following files:

FinalVCFCleanup.out
Module08Annotation.output_vcf
FilterSample.out
ResetPESRTGTOverdispersionFilter.out
ResetBothsidesSupportFilter.out
ResetHighSRBackgroundFilter.out
FilterVcfWithReferencePanelCalls.out
FilterVcfForCaseSampleGenotype.out
FilterVcfDepthLt5kb.out
GetUniqueNonGenotypedDepthCalls.out
Module0506.vcf
Module0506.complex_genotype_vcf
ConvertCNVsWithoutDepthSupportToBNDs.out_vcf
Module04.genotyped_pesr_vcf
Module04.genotyped_depth_vcf
RewriteSRCoords.annotated_vcf
FilterLargePESRCallsWithoutRawDepthSupport.out
MergePesrVcfs.merged_pesr_vcf
FilterWham.out
Module01.wham_vcf
Module00c.std_wham_vcf
Module00a.wham_vcf

It seems the discrepancy originates from the Whamg workflow. The workflow uses whamg (whamg source code), which seems to have some randomness involved (https://github.com/zeeev/wham/issues/51).

I ran the GATKSVPipelineSingleSample workflow multiple times without call-caching (i.e., setting "read_from_cache" : false and "write_to_cache" : false in the options files), and ran svtest vcf on the final VCFs of the separate executions comparing all to a common baseline. If the pipeline was idempotent, I would expect the same difference/similarity to the baseline in all the executions, however, the executions slightly differ from each other compared to the baseline. Some of the output metrics of svtest vcf comparing a test run with a given UUID vs. the baseline are as follows.

5700c10e-8db3-477f-9ea0-2cef381a8108 829d3048-9a91-4fe3-ae5b-7549cd6e50f8 b6a0c977-8de6-481d-b58d-c9d6d182db2f
final_cleanup_vcf_DEL_tp 4236 4233 4242
final_cleanup_vcf_DEL_fp 9 13 10
final_cleanup_vcf_DEL_fn 21 26 18
final_cleanup_vcf_DUP_tp 1351 1356 1356
final_cleanup_vcf_DUP_fp 16 11 11
final_cleanup_vcf_DUP_fn 13 8 8
final_cleanup_vcf_INS_tp 2442 2446 2444
final_cleanup_vcf_INS_fp 5 5 3
final_cleanup_vcf_INS_fn 16 12 14
final_cleanup_vcf_INV_tp 9 9 9
final_cleanup_vcf_INV_fp 0 0 0
final_cleanup_vcf_INV_fn 0 0 0
final_cleanup_vcf_BND_tp 1397 1395 1397
final_cleanup_vcf_BND_fp 3 1 3
final_cleanup_vcf_BND_fn 5 7 4
final_cleanup_vcf_CPX_tp 83 84 84
final_cleanup_vcf_CPX_fp 14 13 12
final_cleanup_vcf_CPX_fn 1 0 0
final_cleanup_vcf_CNV_tp 198 198 198
final_cleanup_vcf_CNV_fp 0 0 0
final_cleanup_vcf_CNV_fn 0 0 0
mwalker174 commented 2 years ago

Also svtk's pe-test and sr-test tools do not set PRNG seeds.

epiercehoffman commented 1 year ago

Also missing random seed in svtk adjudicate for subsampling function https://github.com/broadinstitute/gatk-sv/blob/main/src/svtk/svtk/adjudicate/random_forest.py#L82 (and 86)

VJalili commented 1 year ago

The pipeline's Wham-related irreproducible output is related to using a random number generator with a floating seed in the Wham source code. The steps we took to fix this issue are: