Clip primer sequences post alignment

finswimmer commented 5 years ago

Hello,

we are doing targeted sequencing using a Qiagen Kit. During library prep the dna is fragmented and tagged with UMI and a universal primer sequence at one end of the strand. So for the enrichment only one specific primer per amplicon is needed.

I would like to clip the primer sequences after alignment. fgbio provides the tools TrimPrimers and ClipBam. The first one expect a pair of primers, where the second one clippes a fixed number of bases.

Qiagen provides a file like this:

chr3    178928203   1   AAAAGCTATTATATATACATAAGAGAGAAGGTTTGACTGCC
chr3    178928223   0   TCCATGCTTAGAGTTGGAGTTTGACTGGTTC
chr3    178928286   0   GATTGAAGAGCATGCCAATTGGTCTGTATCC

The first two columns describes the start position of the primer, the third whether the read is a forward (0) or reverse (1) primer and in the 5th column the sequence.

What I wish is a tools that takes the bam and the above tab separated text file as input and clippes (soft and/or hard) the primers from the reads after alignment. It should take care if the opposite strand overlaps the clipped region and clip those bases as well.

Could fgbio provide such a function?

Thanks!

fin swimmer

FYI: I asked for such a tool before on biostars.

nh13 commented 5 years ago

@finswimmer I haven't tried it, but you might try running TrimPrimers as follows.

Assume a forward primer. Set the coordinates chrom/left_start/left_end (as described in the usage) to the primer coordinates, and set right_start to one base past the end of the amplicon, and right_end as one minus right_start.

Assume a reverse primer. Set the coordinates chrom/right_start/right_end (as described in the usage) to the primer coordinates, and set left_start to one base before the start of the amplicon, and left_end as one minus left_start.

This may cause an exception in TrimPrimers but the idea is that the opposite primer has effective length zero so no bases will be trimmed, and the input primer file will accurately define the amplicon (for matching the read pair to an amplicon).

If that doesn't work, likely we'd need to modify or build a new tool for your purpose. Given our current client commitments we likely will not be able to get to this any time soon. Therefore, if you're serious about needing this, feel free to reach out to us and inquire about working together: http://www.fulcrumgenomics.com/contact/

finswimmer commented 5 years ago

Hello @nh13,

unfortunately this doesn't work. The reads are trimmed to much. I guess the problem is:

Paired end reads that map to a given amplicon position are trimmed so that the alignment no-longer includes the primer sequences. All other aligned reads have the maximum primer length trimmed!

Another work around I could imagine, is to extract the mapping position of the read paires and build a new primer file based on that (Looking which read starts a a primer position and choose the primer position and length and of the second strand in that way, that it becomes 0). I could not test this approach until now.

fin swimmer

nh13 commented 5 years ago

What reads are trimmed too much and why? That doesn’t make sense to me.

finswimmer commented 5 years ago

I was afraid that my description is confusing. So next try ;)

I have paired end data. So TrimPrimers need to know where an amplicon starts and where it ends. If it finds an amplicon, for which no coordinates are provided, it will trim the 5' end of both reads by the maximum primer length provided.

Due to my library prep, where one end is randomly fragmented and the other one is determined by the primer, I only now for sure one end of my amplicon and I cannot provide easily the second end.

This said, if I provide chrom/left_start/left_end and set right_start/right_end to something invalid or that result in 0 length (in case of forward primer) as you described, TrimPrimers cannot identify the amplicon correct and it will trim both 5' ends by the maximum primer length it can find.

What I expect is to just trim the 5' end, which contain the primer sequence and any overlapping bases from the opposite strand.

nh13 commented 5 years ago

I understand now. Let me see what I can do. It may be a little while as I have to prioritize client work but it shouldn’t be that hard.

nh13 commented 4 years ago

Sorry, this never bubbled up to priority list so I could get time to work on it. Moving to help-wanted in case others want to take a shot.

rspreafico-vir commented 4 years ago

Would it be sufficient to provide a flag to disable at will specifically this part:

All other aligned reads have the maximum primer length trimmed!

That is, when the user enables this new flag, TrimPrimers only trims what the tab-delimited text file provides?

Such a flag would be generally useful beyond the motivating Qiagen protocol.

In addition, it would enable the workaround suggested by @nh13 for immediate resolution, without having to handle the specifics of the Qiagen assay.

rspreafico-vir commented 4 years ago

Somehow connected: could TrimPrimers provide a list of trimmed positions, trimmed reads and summaries? This would help towards understanding what is being cut as diagnostics. For the use case in this thread, having such output would immediately reveal whether only the 5' end is being trimmed as intended, or whether there is additional unwanted trimming.

nh13 commented 4 years ago

@rspreafico-vir I think your suggestions are reasonable to investigate. For the short-term, we're prioritizing paid client work given our limited work capacity in these trying times. Feel free to reach out to us if you'd like to work with us.

rspreafico-vir commented 4 years ago

@nh13 Sure thing, very understandable!

We are using TrimPrimers as recommended by your friends at Paragon Genomics. The current implementation works for our needs and for that kit without any modification.

This thread caught my attention as I can imagine such a flag being useful to extend the tool to remove primers for example from long, partially overlapping amplicons broken down with Nextera. In that case, only the two ends of each amplicon would require trimming, whereas all other "internal" fragments should not be trimmed.

Very ok for me to put this in back-burner. Thanks!

fulcrumgenomics / fgbio

Clip primer sequences post alignment #532