arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
314 stars 118 forks source link

How does lumpy handle soft-clipped bases in bam file? #280

Open xyw1 opened 5 years ago

xyw1 commented 5 years ago

Hello, I tried you tool for both UMI-tagged bam file, in which UMI sequences are soft-clipped, and UMI-trimmed bam file, and I found that after trimming UMI sequences, both supporting reads number (SU tag) and run time increase dramatically, so I wonder if there is any procedure that ignores reads according to their soft-clip length?

ryanlayer commented 5 years ago

How many bases are being trimmed?

On Dec 4, 2018, at 1:15 AM, xuyw notifications@github.com wrote:

Hello, I tried you tool for both UMI-tagged bam file, in which UMI sequences are soft-clipped, and UMI-trimmed bam file, and I found that after trimming UMI sequences, both supporting reads number (SU tag) and run time increase dramatically, so I wonder if there is any procedure that ignores reads according to their soft-clip length?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

xyw1 commented 5 years ago

21bp

ryanlayer commented 5 years ago

I am surprised that after trimming the run time increases. Can you explain to me how you trim?

On Tue, Dec 4, 2018 at 7:25 AM xuyw notifications@github.com wrote:

21bp

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/280#issuecomment-444117461, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUc8x08oSRn-1NMs0DCCLN_30jWebks5u1oXbgaJpZM4ZAIOT .

xyw1 commented 5 years ago

I guess the run time increase is related to splt&disc reads number increase. Actually the problem I care more about is why lumpy has omitted so many reads when they're tagged with UMI.

Here is some data about splt&disc reads number and run_time

SampleID reads_n discorant_reads_n split_mapping_reads_n run_time_second
sample_1_full_length 2789185 349618 1080174 29.16
sample_1_trim_UMI_21 2005583 585602 1713312 172558.96
sample_1_trim_UMI_39 1987282 599890 1676850 159510.22
sample_2_full_length 2298425 371700 881921 37.71
sample_2_trim_UMI_21 97560 18449 47248 194.84
sample_2_trim_UMI_39 1734054 538511 1427644 123412.39
sample_3_full_length 146236 16200 44837 2.56
sample_3_trim_UMI_21 142262 31385 88046 372.59
sample_3_trim_UMI_39 141628 34751 86782 383.34
sample_4_full_length 7267 46 15 0.76
sample_4_trim_UMI_21 7220 53 28 0.68
sample_4_trim_UMI_39 7219 53 26 0.77

image

And here is the process I trim UMI

UMI-tagged bamfile

This is a UMI-tagged bam file, those soft-clipped bases are UMI

FS10000223:4:BNT40301-1434:1:1105:6740:3440     99      chr1    26767   0       21S130M =       26786   150     GGTACCCACATAAGGCGAACTCTCTTAGCAGAATGTGTGCCTCTCGGCCGGGCGCAGCGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCGAAGGCAGGCAGATCACCTGAGGTCGGGAGTTTGAGACCAGTCTGACCAACATGGTGAA FFFFFFFFFFF,FFFFFFFF,:::,::F:FFFFFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF NM:i:1  MD:Z:103C26     MC:Z:131M20S    AS:i:125        XS:i:125        RG:Z:CC_iSeq_Nov15_701
FS10000223:4:BNT40301-1434:1:1105:6740:3440     147     chr1    26786   0       131M20S =       26767   -150    CTCTCGGCCGGGCGCAGCGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCGAAGGCAGGCAGATCACCTGAGGTCGGGAGTTTGAGACCAGTCTGACCAACATGGTGAAACTCCATCTCTACTAAAAATGTTCGCCTTAATAGGTGGAG :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:,,FF::,FFFFFFF,FFFFFFFFF:F NM:i:1  MD:Z:84C46      MC:Z:21S130M    AS:i:126        XS:i:126        RG:Z:CC_iSeq_Nov15_701

In order to trim UMI sequences, I

lumpyexpress -B $bam -o $vcf

take one gene fusion for example, the results are:

result of UMI-tagged bam file

chr2    29223529        14_1    N       N]chr2:42301391]        .       .       SVTYPE=BND;STRANDS=++:5;EVENT=14;MATEID=14_2;CIPOS=-9,75;CIEND=-7,84;CIPOS95=-1,24;CIEND95=0,24;IMPRECISE;SU=5;PE=5;SR=0        GT:SU:PE:SR     ./.:5:5:0
chr2    42301391        14_2    N       N]chr2:29223529]        .       .       SVTYPE=BND;STRANDS=++:5;SECONDARY;EVENT=14;MATEID=14_1;CIPOS=-7,84;CIEND=-9,75;CIPOS95=0,24;CIEND95=-1,24;IMPRECISE;SU=5;PE=5;SR=0      GT:SU:PE:SR     ./.:5:5:0

supporting reads number is 5

result of UMI-trimmed bam file

chr2    29223530        13_1    N       N]chr2:42301392]        .       .       SVTYPE=BND;STRANDS=++:500166;EVENT=13;MATEID=13_2;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;SU=500166;PE=136681;SR=363485     GT:SU:PE:SR     ./.:500166:136681:363485
chr2    42301392        13_2    N       N]chr2:29223530]        .       .       SVTYPE=BND;STRANDS=++:500166;SECONDARY;EVENT=13;MATEID=13_1;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;SU=500166;PE=136681;SR=363485   GT:SU:PE:SR     ./.:500166:136681:363485

supporting reads number is 500166

I can't upload bam file here for the file size limit, if you need them, you may leave your email.