How does lumpy handle soft-clipped bases in bam file?

xyw1 commented 5 years ago

Hello, I tried you tool for both UMI-tagged bam file, in which UMI sequences are soft-clipped, and UMI-trimmed bam file, and I found that after trimming UMI sequences, both supporting reads number (SU tag) and run time increase dramatically, so I wonder if there is any procedure that ignores reads according to their soft-clip length?

ryanlayer commented 5 years ago

How many bases are being trimmed?

On Dec 4, 2018, at 1:15 AM, xuyw notifications@github.com wrote:

Hello, I tried you tool for both UMI-tagged bam file, in which UMI sequences are soft-clipped, and UMI-trimmed bam file, and I found that after trimming UMI sequences, both supporting reads number (SU tag) and run time increase dramatically, so I wonder if there is any procedure that ignores reads according to their soft-clip length?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

xyw1 commented 5 years ago

21bp

ryanlayer commented 5 years ago

I am surprised that after trimming the run time increases. Can you explain to me how you trim?

On Tue, Dec 4, 2018 at 7:25 AM xuyw notifications@github.com wrote:

21bp

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arq5x/lumpy-sv/issues/280#issuecomment-444117461, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUc8x08oSRn-1NMs0DCCLN_30jWebks5u1oXbgaJpZM4ZAIOT .

xyw1 commented 5 years ago

I guess the run time increase is related to splt&disc reads number increase. Actually the problem I care more about is why lumpy has omitted so many reads when they're tagged with UMI.

Here is some data about splt&disc reads number and run_time

SampleID	reads_n	discorant_reads_n	split_mapping_reads_n	run_time_second
sample_1_full_length	2789185	349618	1080174	29.16
sample_1_trim_UMI_21	2005583	585602	1713312	172558.96
sample_1_trim_UMI_39	1987282	599890	1676850	159510.22
sample_2_full_length	2298425	371700	881921	37.71
sample_2_trim_UMI_21	97560	18449	47248	194.84
sample_2_trim_UMI_39	1734054	538511	1427644	123412.39
sample_3_full_length	146236	16200	44837	2.56
sample_3_trim_UMI_21	142262	31385	88046	372.59
sample_3_trim_UMI_39	141628	34751	86782	383.34
sample_4_full_length	7267	46	15	0.76
sample_4_trim_UMI_21	7220	53	28	0.68
sample_4_trim_UMI_39	7219	53	26	0.77

reads number is calculated by samtools view $bam | wc -l
splt.bam and disc.bam are generated using lumpy_filter

And here is the process I trim UMI

UMI-tagged bamfile

This is a UMI-tagged bam file, those soft-clipped bases are UMI

FS10000223:4:BNT40301-1434:1:1105:6740:3440     99      chr1    26767   0       21S130M =       26786   150     GGTACCCACATAAGGCGAACTCTCTTAGCAGAATGTGTGCCTCTCGGCCGGGCGCAGCGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCGAAGGCAGGCAGATCACCTGAGGTCGGGAGTTTGAGACCAGTCTGACCAACATGGTGAA FFFFFFFFFFF,FFFFFFFF,:::,::F:FFFFFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF NM:i:1  MD:Z:103C26     MC:Z:131M20S    AS:i:125        XS:i:125        RG:Z:CC_iSeq_Nov15_701
FS10000223:4:BNT40301-1434:1:1105:6740:3440     147     chr1    26786   0       131M20S =       26767   -150    CTCTCGGCCGGGCGCAGCGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCGAAGGCAGGCAGATCACCTGAGGTCGGGAGTTTGAGACCAGTCTGACCAACATGGTGAAACTCCATCTCTACTAAAAATGTTCGCCTTAATAGGTGGAG :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:,,FF::,FFFFFFF,FFFFFFFFF:F NM:i:1  MD:Z:84C46      MC:Z:21S130M    AS:i:126        XS:i:126        RG:Z:CC_iSeq_Nov15_701

In order to trim UMI sequences, I

firstly converted bam file into fastq
and then cut off 21bp at 5' end for each read

finally remap fastq to genome, generating the UMI-trimmed bam file

UMI-trimmed bam file

FS10000223:4:BNT40301-1434:1:1116:13990:1480    65      chr1    43426   0       130M    chr2    29223469        0       TCATCTCAATAGATGCAGAAAAAGCATTAACAAAAGTAAACATTCTTTCATAATAAGACATCAGATAAAACAAATTAGGAATAGAAGGAATGTACCGCAACACAATAAAGGCCATATATAACAAGCCCAC      FFFFFF,FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      NM:i:0  MD:Z:130        MC:Z:18S62M50S  AS:i:130        XS:i:130        RG:Z:CC_iSeq_Nov15_701.UMI.trim_UMI     XA:Z:chr19,+85037,130M,0;chr15,-101947408,130M,1;
FS10000223:4:BNT40301-1434:1:1105:6740:3440     99      chr1    197289  0       130M    =       197308  149     CTCTTAGCAGAATGTGTGCCTCTCGGCCGGGCGCAGCGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCGAAGGCAGGCAGATCACCTGAGGTCGGGAGTTTGAGACCAGTCTGACCAACATGGTGAA      :::,::F:FFFFFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF      NM:i:1  MD:Z:103C26     MC:Z:130M       AS:i:125        XS:i:125        RG:Z:CC_iSeq_Nov15_701.UMI.trim_UMI

Using these two sort of bam files, I ran lumpy in this way

lumpyexpress -B $bam -o $vcf

take one gene fusion for example, the results are:

result of UMI-tagged bam file

chr2    29223529        14_1    N       N]chr2:42301391]        .       .       SVTYPE=BND;STRANDS=++:5;EVENT=14;MATEID=14_2;CIPOS=-9,75;CIEND=-7,84;CIPOS95=-1,24;CIEND95=0,24;IMPRECISE;SU=5;PE=5;SR=0        GT:SU:PE:SR     ./.:5:5:0
chr2    42301391        14_2    N       N]chr2:29223529]        .       .       SVTYPE=BND;STRANDS=++:5;SECONDARY;EVENT=14;MATEID=14_1;CIPOS=-7,84;CIEND=-9,75;CIPOS95=0,24;CIEND95=-1,24;IMPRECISE;SU=5;PE=5;SR=0      GT:SU:PE:SR     ./.:5:5:0

supporting reads number is 5

result of UMI-trimmed bam file

chr2    29223530        13_1    N       N]chr2:42301392]        .       .       SVTYPE=BND;STRANDS=++:500166;EVENT=13;MATEID=13_2;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;SU=500166;PE=136681;SR=363485     GT:SU:PE:SR     ./.:500166:136681:363485
chr2    42301392        13_2    N       N]chr2:29223530]        .       .       SVTYPE=BND;STRANDS=++:500166;SECONDARY;EVENT=13;MATEID=13_1;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;SU=500166;PE=136681;SR=363485   GT:SU:PE:SR     ./.:500166:136681:363485

supporting reads number is 500166

I can't upload bam file here for the file size limit, if you need them, you may leave your email.

arq5x / lumpy-sv