Griffan / VerifyBamID

VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
http://griffan.github.io/VerifyBamID/
92 stars 15 forks source link

incorrect value or segfault for long-read data #59

Closed cheehongsg closed 8 months ago

cheehongsg commented 8 months ago

For long-read data, read contains Insertion and Deletion cigar operations. Consequently, original code assumption that the bases and quals will have the same element is broken. This manifests as either no crash but incorrect computed values as the qual is offset to unexpected memory location(s) or a crash if the unexpected memory location has already been freed.

Fix by adapting original function ParsePileupSeq() to ParsePileupSeqBasesOnly() to generate a copy of matching bases and quals following https://www.htslib.org/doc/samtools-mpileup.html#Pileup_Format

Griffan commented 8 months ago

@cheehongsg Thank you very much! This is indeed an important update for long reads' pileup files. Previously ReadPileup mainly served as a debugging function in short reads realm.

I believe you encounter this issue from real case you encountered? Would you mind to add this problematic pileup line(or together with a few lines before and after) as an extra resource file(as well as your commandline if possible) under directory 'resource/test/' ?

cheehongsg commented 8 months ago

Added resource/test/problematic.pileup

# command executed:
VerifyBamID \
--PileupFile resource/test/problematic.pileup \
--UDPath resource/1000g.phase3.100k.b38.vcf.gz.dat.UD \
--BedPath resource/1000g.phase3.100k.b38.vcf.gz.dat.bed \
--MeanPath resource/1000g.phase3.100k.b38.vcf.gz.dat.mu \
--Reference human_GRCh38_no_alt_analysis_set.fasta \
--Output resource/test/problematic.VerifyBamID \
; echo $?

# output produced:
VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.

 Version:2.0.1
 Copyright (c) 2009-2020 by Hyun Min Kang and Fan Zhang
 This project is licensed under the terms of the MIT license.

The following parameters are available.  Ones with "[]" are in effect:

Available Options
                    Input/Output Files : --BamFile [Empty],
                                         --PileupFile [resource/test/problematic.pileup],
                                         --Reference [human_GRCh38_no_alt_analysis_set.fasta],
                                         --SVDPrefix [Empty],
                                         --Output [resource/test/problematic.VerifyBamID]
               Model Selection Options : --WithinAncestry,
                                         --DisableSanityCheck, --NumPC [2],
                                         --FixPC [Empty],
                                         --FixAlpha [-1.0e+00],
                                         --KnownAF [Empty], --NumThread [4],
                                         --Seed [12345], --Epsilon [1.0e-08],
                                         --OutputPileup, --Verbose
   Construction of SVD Auxiliary Files : --RefVCF [Empty]
                        Pileup Options : --min-BQ [13], --min-MQ [2],
                                         --adjust-MQ [40], --max-depth [8000],
                                         --no-orphans, --incl-flags [1040],
                                         --excl-flags [1796]
                    Deprecated Options : --UDPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.UD],
                                         --MeanPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.mu],
                                         --BedPath [resource/1000g.phase3.100k.b38.vcf.gz.dat.bed]

Initialize from FullLLKFunc(int dim, ContaminationEstimator* contPtr)
NOTICE - Number of marker in Reference Matrix:100000
NOTICE - Number of marker shared with input file:10048
NOTICE - Mean Depth:31.987261
NOTICE - SD Depth:-nan
NOTICE - 10048 SNP markers remained after sanity check.
NOTICE - Passing Marker Sanity Check...
Segmentation fault
139

After the fix:

# no segfault after the line "NOTICE - Passing Marker Sanity Check..."
# and the output is:
Estimation from OptimizeHeter:
Contaminating Sample PC1:-0.258868      PC2:1.22003
Intended Sample PC1:-0.00959451 PC2:-0.0093674
FREEMIX(Alpha):1.97957e-07
NOTICE - Success!
0
Griffan commented 8 months ago

@cheehongsg Thanks you very much!

In order to verify this change didn't affect the existing short read bam processing function, I added another unit test into the master.

Would you mind to compare your PR against branch 'cheehongsg/dev' so that I can run a local test before merging your branch.

Or you can pull the master and update your local branch with:

``` git merge master ``` and then push to your branch to refresh this PR?
cheehongsg commented 8 months ago

@Griffan Thanks for setting up the test. It caught a bug that I have introduced and I has fixed it.

# command:
make test

# output:
Running tests...
Test project testMerged/VerifyBamID/build
    Start 1: myTest1
1/4 Test #1: myTest1 ..........................   Passed    0.72 sec
    Start 2: myTest2
2/4 Test #2: myTest2 ..........................   Passed    0.00 sec
    Start 3: myTest3
3/4 Test #3: myTest3 ..........................   Passed    0.14 sec
    Start 4: myTest4
4/4 Test #4: myTest4 ..........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 4

Total Test time (real) =   0.87 sec

Let me know if it is still breaking the CI.

Cheers!

Griffan commented 8 months ago

@cheehongsg Thanks! It looks the CI is working properly.

One more minor suggestion if you don't mind: could you rename the pileup file to "test.LongRead.pileup" and subset it a bit to serve the testing purpose? For example, the unit test used the tiny pileup with only 15 markers.

VerifyBamID --DisableSanityCheck --PileupFile resource/test/expected/result.Pileup --SVDPrefix resource/test/hapmap_3.3.b37.dat --Reference resource/test/chr20.fa.gz --NumPC 2

Or you can compare your PR against branch 'cheehongsg/dev' so that I can do the local adjustments before merging your branch.

Thank you very much!

cheehongsg commented 8 months ago

@Griffan I have renamed the pileup file and subset the entries following your suggestions and your unit tests.

I have updated CMakeList.txt to include "myTest5" and "myTest6".

# Testing output:
Running tests...
Test project testMerged/VerifyBamID/build
    Start 1: myTest1
1/6 Test #1: myTest1 ..........................   Passed    1.20 sec
    Start 2: myTest2
2/6 Test #2: myTest2 ..........................   Passed    0.01 sec
    Start 3: myTest3
3/6 Test #3: myTest3 ..........................   Passed    0.14 sec
    Start 4: myTest4
4/6 Test #4: myTest4 ..........................   Passed    0.00 sec
    Start 5: myTest5
5/6 Test #5: myTest5 ..........................   Passed    0.28 sec
    Start 6: myTest6
6/6 Test #6: myTest6 ..........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) =   1.64 sec
Griffan commented 8 months ago

@Griffan I have renamed the pileup file and subset the entries following your suggestions and your unit tests.

I have updated CMakeList.txt to include "myTest5" and "myTest6".

# Testing output:
Running tests...
Test project testMerged/VerifyBamID/build
    Start 1: myTest1
1/6 Test #1: myTest1 ..........................   Passed    1.20 sec
    Start 2: myTest2
2/6 Test #2: myTest2 ..........................   Passed    0.01 sec
    Start 3: myTest3
3/6 Test #3: myTest3 ..........................   Passed    0.14 sec
    Start 4: myTest4
4/6 Test #4: myTest4 ..........................   Passed    0.00 sec
    Start 5: myTest5
5/6 Test #5: myTest5 ..........................   Passed    0.28 sec
    Start 6: myTest6
6/6 Test #6: myTest6 ..........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) =   1.64 sec

Fantastic! Thank you very much!