Closed cheehongsg closed 8 months ago
@cheehongsg Thank you very much! This is indeed an important update for long reads' pileup files. Previously ReadPileup mainly served as a debugging function in short reads realm.
I believe you encounter this issue from real case you encountered? Would you mind to add this problematic pileup line(or together with a few lines before and after) as an extra resource file(as well as your commandline if possible) under directory 'resource/test/' ?
Added resource/test/problematic.pileup
# command executed:
VerifyBamID \
--PileupFile resource/test/problematic.pileup \
--UDPath resource/1000g.phase3.100k.b38.vcf.gz.dat.UD \
--BedPath resource/1000g.phase3.100k.b38.vcf.gz.dat.bed \
--MeanPath resource/1000g.phase3.100k.b38.vcf.gz.dat.mu \
--Reference human_GRCh38_no_alt_analysis_set.fasta \
--Output resource/test/problematic.VerifyBamID \
; echo $?
# output produced:
VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
Version:2.0.1
Copyright (c) 2009-2020 by Hyun Min Kang and Fan Zhang
This project is licensed under the terms of the MIT license.
The following parameters are available. Ones with "[]" are in effect:
Available Options
Input/Output Files : --BamFile [Empty],
--PileupFile [resource/test/problematic.pileup],
--Reference [human_GRCh38_no_alt_analysis_set.fasta],
--SVDPrefix [Empty],
--Output [resource/test/problematic.VerifyBamID]
Model Selection Options : --WithinAncestry,
--DisableSanityCheck, --NumPC [2],
--FixPC [Empty],
--FixAlpha [-1.0e+00],
--KnownAF [Empty], --NumThread [4],
--Seed [12345], --Epsilon [1.0e-08],
--OutputPileup, --Verbose
Construction of SVD Auxiliary Files : --RefVCF [Empty]
Pileup Options : --min-BQ [13], --min-MQ [2],
--adjust-MQ [40], --max-depth [8000],
--no-orphans, --incl-flags [1040],
--excl-flags [1796]
Deprecated Options : --UDPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.UD],
--MeanPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.mu],
--BedPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.bed]
Initialize from FullLLKFunc(int dim, ContaminationEstimator* contPtr)
NOTICE - Number of marker in Reference Matrix:100000
NOTICE - Number of marker shared with input file:10048
NOTICE - Mean Depth:31.987261
NOTICE - SD Depth:-nan
NOTICE - 10048 SNP markers remained after sanity check.
NOTICE - Passing Marker Sanity Check...
Segmentation fault
139
After the fix:
# no segfault after the line "NOTICE - Passing Marker Sanity Check..."
# and the output is:
Estimation from OptimizeHeter:
Contaminating Sample PC1:-0.258868 PC2:1.22003
Intended Sample PC1:-0.00959451 PC2:-0.0093674
FREEMIX(Alpha):1.97957e-07
NOTICE - Success!
0
@cheehongsg Thanks you very much!
In order to verify this change didn't affect the existing short read bam processing function, I added another unit test into the master.
Would you mind to compare your PR against branch 'cheehongsg/dev' so that I can run a local test before merging your branch.
Or you can pull the master and update your local branch with:
@Griffan Thanks for setting up the test. It caught a bug that I have introduced and I has fixed it.
# command:
make test
# output:
Running tests...
Test project testMerged/VerifyBamID/build
Start 1: myTest1
1/4 Test #1: myTest1 .......................... Passed 0.72 sec
Start 2: myTest2
2/4 Test #2: myTest2 .......................... Passed 0.00 sec
Start 3: myTest3
3/4 Test #3: myTest3 .......................... Passed 0.14 sec
Start 4: myTest4
4/4 Test #4: myTest4 .......................... Passed 0.00 sec
100% tests passed, 0 tests failed out of 4
Total Test time (real) = 0.87 sec
Let me know if it is still breaking the CI.
Cheers!
@cheehongsg Thanks! It looks the CI is working properly.
One more minor suggestion if you don't mind: could you rename the pileup file to "test.LongRead.pileup" and subset it a bit to serve the testing purpose? For example, the unit test used the tiny pileup with only 15 markers.
VerifyBamID --DisableSanityCheck --PileupFile resource/test/expected/result.Pileup --SVDPrefix resource/test/hapmap_3.3.b37.dat --Reference resource/test/chr20.fa.gz --NumPC 2
Or you can compare your PR against branch 'cheehongsg/dev' so that I can do the local adjustments before merging your branch.
Thank you very much!
@Griffan I have renamed the pileup file and subset the entries following your suggestions and your unit tests.
I have updated CMakeList.txt to include "myTest5" and "myTest6".
# Testing output:
Running tests...
Test project testMerged/VerifyBamID/build
Start 1: myTest1
1/6 Test #1: myTest1 .......................... Passed 1.20 sec
Start 2: myTest2
2/6 Test #2: myTest2 .......................... Passed 0.01 sec
Start 3: myTest3
3/6 Test #3: myTest3 .......................... Passed 0.14 sec
Start 4: myTest4
4/6 Test #4: myTest4 .......................... Passed 0.00 sec
Start 5: myTest5
5/6 Test #5: myTest5 .......................... Passed 0.28 sec
Start 6: myTest6
6/6 Test #6: myTest6 .......................... Passed 0.00 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1.64 sec
@Griffan I have renamed the pileup file and subset the entries following your suggestions and your unit tests.
I have updated CMakeList.txt to include "myTest5" and "myTest6".
# Testing output: Running tests... Test project testMerged/VerifyBamID/build Start 1: myTest1 1/6 Test #1: myTest1 .......................... Passed 1.20 sec Start 2: myTest2 2/6 Test #2: myTest2 .......................... Passed 0.01 sec Start 3: myTest3 3/6 Test #3: myTest3 .......................... Passed 0.14 sec Start 4: myTest4 4/6 Test #4: myTest4 .......................... Passed 0.00 sec Start 5: myTest5 5/6 Test #5: myTest5 .......................... Passed 0.28 sec Start 6: myTest6 6/6 Test #6: myTest6 .......................... Passed 0.00 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 1.64 sec
Fantastic! Thank you very much!
For long-read data, read contains Insertion and Deletion cigar operations. Consequently, original code assumption that the bases and quals will have the same element is broken. This manifests as either no crash but incorrect computed values as the qual is offset to unexpected memory location(s) or a crash if the unexpected memory location has already been freed.
Fix by adapting original function ParsePileupSeq() to ParsePileupSeqBasesOnly() to generate a copy of matching bases and quals following https://www.htslib.org/doc/samtools-mpileup.html#Pileup_Format