fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
314 stars 67 forks source link

unwanted trimming of overhanging reads during CallMolecularConsensusReads #526

Closed tomSimonet closed 5 years ago

tomSimonet commented 5 years ago

Hello, I'am using GroupReadsByUmi followed by CallMolecularConsensusReads on personal paired-end data; in some amplicons, paired reads overlap and even overhang, eg the 3'end of the reverse read ends before the 5' end of the forward read : .....-----> <--------- this is potentially due to primer trimming on F1 read, in the fastq; in these cases, CallMolecularConsensusReads trim the reads to just the overlapping part: .....-----> .....<----- the pb is that this makes me loosing some wanted dna variant is the following variant calling;

thanks!!

cmd and example:

1) GroupReadsByUmi : java -jar ./fgbio-1.0.0.jar GroupReadsByUmi --input=S1_setMQ.bam --output=S1_umiGrp.bam --family-size-histogram=S1_hist.txt --raw-tag=RX --assign-tag=MI --strategy=adjacency --edits=1 --min-map-q=20 samtools view S1_umiGrp.bam | grep "RX:Z:CGAAGGGCTATG" NB501480:383:HK2MCAFXY:1:11306:24370:4112 99 chr1 120465271 60 79M = 120465226 79 CAAGAGGGTATGACAGGGTCCCCTGAATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEE6EE//AEAE/E/EEE/E/EEE/A/EEA/EE/E/EEE<A///EEE/EE6EAA<E/EEEEAEAEEAA<A/A/EEEE// MC:Z:4S124M MD:Z:25T53 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:1 MQ:i:60 AS:i:74 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:11306:24370:4112 147 chr1 120465226 60 4S124M = 120465271 -79 TTTGGCCACCTTTCCCCTTTACACCAGTGCCACTCACTGTCGACTGACACAAGAGGGTATGACAGGGTCCCCTGTTGGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG </A//A<6<EEEE/<</</</EE/EAEEA/EA/</AE/E///6/AAEEA//E</E/EAEEE<<EE/E//EEE/AA//EA/EEAEA//E/EAEE<AAE6EE//EA/EA</E/E/E/66EEEE/E/EEEE MC:Z:79M MD:Z:35A4A30A0T51 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:4 MQ:i:60 AS:i:104 XS:i:20 RX:Z:CGAAGGGCTATG

2) CallMolecularConsensusReads: java -jar ./fgbio-1.0.0.jar CallMolecularConsensusReads --input=S1_umiGrp.bam --output=S1_umiCss.bam --tag=MI --read-name-prefix=csr --error-rate-pre-umi=30 --error-rate-post-umi=30 --min-reads=1 --min-input-base-quality=10 --min-consensus-base-quality 0 samtools view S1_umiCss.bam | grep "RX:Z:CGAAGGGCTATG" csr:350549 77 0 0 0 0 CAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG ???>??>>????>?>???>?>???>$>???>??>?>?????>>>???>??>?????>?????????????>?>????>> cD:i:2 cE:f:0.00632911 RG:Z:19SL00999_S11_1 MI:Z:350549 cM:i:2 RX:Z:CGAAGGGCTATG csr:350549 141 0 0 0 0 CACTGCTTCAAGAACACGGATGCAGCAGCAGCTCTCCTGGCCTCTCACGCCATACAGGGGACCCTGTCATACCCTCTTG ????>?>????>>>?>?>?>???>??>>??>????????>?>>?????>??00??>??>>>?>?????????>?<??>> cD:i:2 cE:f:0.0126582 RG:Z:19SL00999_S11_1 MI:Z:350549 cM:i:2 RX:Z:CGAAGGGCTATG

nh13 commented 5 years ago

@tomSimonet can you share a SAM/BAM that's been subsetted to just the reads with the MI tag 350549 from the GroupReadsByUmi BAM (please included the SAM header lines too)? This would help us debug the issue.

samtools view -H <in.bam> > test.sam
samtools view <in.bam> | grep "MI:Z:350549" >> test.sam
tomSimonet commented 5 years ago

Hello,

Here is the sam for the group 350549, and , in case it might help, another sam for the whole UMI CGAAGGGCTATG it is derived from.

Many thanks

Thomas

group:350549 sam

@HD VN:1.5 SO:unsorted GO:query SS:unsorted:template-coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @RG ID:19SL00999_S11_1 LB:na PL:na SM:19SL00999_S11 PU:na @PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa mem -t 3 -R @RG\tID:19SL00999_S11_1\tLB:na\tPL:na\tSM:19SL00999_S11\tPU:na hg19_min_oldM bbduk_fastq/19SL00999_S11_bbduk_R1.fastq.gz bbduk_fastq/19SL00999_S11_bbduk_R2.fastq.gz NB501480:383:HK2MCAFXY:1:11306:24370:4112 99 chr1 120465271 60 79M = 120465226 79 CAAGAGGGTATGACAGGGTCCCCTGAATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEE6EE//AEAE/E/EEE/E/EEE/A/EEA/EE/E/EEE<A///EEE/EE6EAA<E/EEEEAEAEEAA<A/A/EEEE// MC:Z:4S124M MD:Z:25T53 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:1 MQ:i:60 AS:i:74 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:11306:24370:4112 147 chr1 120465226 60 4S124M = 120465271 -79 TTTGGCCACCTTTCCCCTTTACACCAGTGCCACTCACTGTCGACTGACACAAGAGGGTATGACAGGGTCCCCTGTTGGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG </A//A<6<EEEE/<</</</EE/EAEEA/EA/</AE/E///6/AAEEA//E</E/EAEEE<<EE/E//EEE/AA//EA/EEAEA//E/EAEE<AAE6EE//EA/EA</E/E/E/66EEEE/E/EEEE MC:Z:79M MD:Z:35A4A30A0T51 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:4 MQ:i:60 AS:i:104 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:21311:11227:6707 99 chr1 120465271 60 79M = 120465226 79 CAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEAAEEEEEEEEEEAE MC:Z:4S124M MD:Z:79 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:0 MQ:i:60 AS:i:79 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:21311:11227:6707 147 chr1 120465226 60 4S124M = 120465271 -79 TTTGGCCACCTTTCCCCTTTACACCAGTGCCACTCACTGACGACAGACACAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG AAEEEAAAAEEAEEAAEEEE<A<AA<AEEEEEA<AAEEEEEEEEAEAEEEEEE/EEEEAAAEEEEAEAE/EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAE MC:Z:79M MD:Z:124 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:0 MQ:i:60 AS:i:124 XS:i:0 RX:Z:CGAAGGGCTATG

UMI:CGAAGGGCTATG sam

@HD VN:1.5 SO:unsorted GO:query SS:unsorted:template-coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @RG ID:19SL00999_S11_1 LB:na PL:na SM:19SL00999_S11 PU:na @PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa mem -t 3 -R @RG\tID:19SL00999_S11_1\tLB:na\tPL:na\tSM:19SL00999_S11\tPU:na hg19_min_oldM bbduk_fastq/19SL00999_S11_bbduk_R1.fastq.gz bbduk_fastq/19SL00999_S11_bbduk_R2.fastq.gz NB501480:383:HK2MCAFXY:1:11306:24370:4112 99 chr1 120465271 60 79M = 120465226 79 CAAGAGGGTATGACAGGGTCCCCTGAATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEE6EE//AEAE/E/EEE/E/EEE/A/EEA/EE/E/EEE<A///EEE/EE6EAA<E/EEEEAEAEEAA<A/A/EEEE// MC:Z:4S124M MD:Z:25T53 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:1 MQ:i:60 AS:i:74 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:11306:24370:4112 147 chr1 120465226 60 4S124M = 120465271 -79 TTTGGCCACCTTTCCCCTTTACACCAGTGCCACTCACTGTCGACTGACACAAGAGGGTATGACAGGGTCCCCTGTTGGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG </A//A<6<EEEE/<</</</EE/EAEEA/EA/</AE/E///6/AAEEA//E</E/EAEEE<<EE/E//EEE/AA//EA/EEAEA//E/EAEE<AAE6EE//EA/EA</E/E/E/66EEEE/E/EEEE MC:Z:79M MD:Z:35A4A30A0T51 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:4 MQ:i:60 AS:i:104 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:21311:11227:6707 99 chr1 120465271 60 79M = 120465226 79 CAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEAAEEEEEEEEEEAE MC:Z:4S124M MD:Z:79 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:0 MQ:i:60 AS:i:79 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:21311:11227:6707 147 chr1 120465226 60 4S124M = 120465271 -79 TTTGGCCACCTTTCCCCTTTACACCAGTGCCACTCACTGACGACAGACACAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG AAEEEAAAAEEAEEAAEEEE<A<AA<AEEEEEA<AAEEEEEEEEAEAEEEEEE/EEEEAAAEEEEAEAE/EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAE MC:Z:79M MD:Z:124 RG:Z:19SL00999_S11_1 MI:Z:350549 NM:i:0 MQ:i:60 AS:i:124 XS:i:0 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:11203:21426:13989 99 chr1 120465301 60 49M = 120465259 49 CGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG AEEEEEEEEEEEEAEEEEEAEEAEE<E<EEEAEAEEEEEAEEE<EEE6< MC:Z:91M MD:Z:49 RG:Z:19SL00999_S11_1 MI:Z:353556 NM:i:0 MQ:i:60 AS:i:49 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:1:11203:21426:13989 147 chr1 120465259 60 91M = 120465301 -49 TGACGACAGACACAAGAGGGTATGACAGGGTCCCCTGTATGGCGTGAGAGGCCAGGAGAGCTGCTGCTGCATCCGTGTTCTTGAAGCAGTG EEE/A<E<A66///EEEE/AEEE<EEEEEEEEE/EEEEAEAEEEAEEEEEEAE/6EEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MC:Z:49M MD:Z:91 RG:Z:19SL00999_S11_1 MI:Z:353556 NM:i:0 MQ:i:60 AS:i:91 XS:i:0 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:3:11610:13036:10842 99 chrX 39933303 60 105M = 39933380 204 AACGGTCTGCTTCTCCAACAGAGGAGGTGAGCTGCCGTCTTTTCTGTCTTGAACCGCTGTCTTCCGGGCATGCCCGGGCACTGGCTGGGCACCTCCGCCCCCTTC E/EEEEEEEEEEEEEEAEEEEEA<EEAEAEEEEE/EEAAAEEEEEEEEEEEAEEEEEEEEEEEAAAE<EEEEEA<AEAEEEA</A<AAEA<AEA66A<<AA<A<A MC:Z:127M MD:Z:36A57T10 RG:Z:19SL00999_S11_1 MI:Z:3694996 NM:i:2 MQ:i:60 AS:i:95 XS:i:0 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:3:11610:13036:10842 147 chrX 39933380 60 127M = 39933303 -204 GCACTGGCTGGGCACCTCCGCCCCCTTCCGGAGCCTTGGGATACTTGCCATTGGAGAGCCTGGCCGCGGGGAACTCGCTGCTAACTGTCATGTATGGCTTTGACAGGGCAACTGAAGGAGAGGTGGA <AAAEEEAEAEAEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AEEEEEEEEEEEE<EEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEE MC:Z:105M MD:Z:17T109 RG:Z:19SL00999_S11_1 MI:Z:3694996 NM:i:1 MQ:i:60 AS:i:122 XS:i:0 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:4:11612:25668:19647 99 chrX 39933303 60 105M = 39933380 204 AACGGTCTGCTTCTCCAACAGAGGTGGTGAGCTGCCGTCTTTTGTGTCTTGAACCGCTGTCTTCCGGGCACGCCAGGGCACTGGCTGGGCACCTTCGCCCCCTTC A//EEE/AEA/</EEEEA<EEEEE//E/A<E//6//AA/EE/E///</EE</A/E/6//EEEE<EE//E///E//<AEAE<E/EE<6A<</AAE<A<A/<<6AE/ MC:Z:127M MD:Z:24A11A6C26T3C30 RG:Z:19SL00999_S11_1 MI:Z:3694996 NM:i:5 MQ:i:60 AS:i:80 XS:i:20 RX:Z:CGAAGGGCTATG NB501480:383:HK2MCAFXY:4:11612:25668:19647 147 chrX 39933380 60 127M = 39933303 -204 GCACTGGCTGGGCACCTTCGCCCCCTTCCGGAGCCTTGGGATACTTGCCATTGGAGAGCCTGGCCGCGGGGAACTCGCTGCTAACTGTCATGTATGGCTTTGACAGGGCAACTGAAGGAGAGGTGGA /</A//A</A/EEA<A/E//A<EEEA<AEEA<6<<//6EA//EE/<A<EEE/6//<<AEE/EAEEEEEE<///EEE/A/<//EEE<<EA/EA//EAAE///E/E/E<EEE/E/AE6EEE////E/<< MC:Z:105M MD:Z:127 RG:Z:19SL00999_S11_1 MI:Z:3694996 NM:i:0 MQ:i:60 AS:i:127 XS:i:0 RX:Z:CGAAGGGCTATG

nh13 commented 5 years ago

@tomSimonet I see the place where it is trimming: https://github.com/fulcrumgenomics/fgbio/blob/master/src/main/scala/com/fulcrumgenomics/umi/UmiConsensusCaller.scala#L247-L256. I could add an option to skip this step, but I am genuinely curious about your experiment. My intuition is that if the 3' end of R2 reads beyond the 5' end of R1 (forward/reverse read pairs), then the 3' end of R2 sequences into the UMI/adapter/primer of R1. Could you tell us more about your experiment where this wouldn't be the case?

tomSimonet commented 5 years ago

You're right, it falls within the R1 specific primer, which I have trimmed only from the R1 read, in order to keep this region (amplicons are in some place not sufficiently overlapping) and to get a less biased estimation of the allelic depth of any variant embedded within it (so only based on the R2 read).

qiaseq

but may be I was wrong to do so? If no, an option to disable this trimming would be great, in this case

nh13 commented 5 years ago

@tomSimonet if your amplifying products based on the gene specific primer, then all products will contain the complementary sequence of the GSP . So when you sequence from read 2, you'll read through the template (sample DNA fragment) and into the primer sequence. If there's a variant under primer, you get worse amplification (i.e. allele amplification bias), but the products themselves will always have the same sequence as the primer. So you'll need to trim it off on both ends of the read: for read one, it's at the start of the read, and for read two, it can be variably at the end of the read. It's the same principle when you have short inserts and reading into adapters. I would not use any of the synthetic (i.e. primer) bases as part of my allele depth for the sample (or for variant calling).