fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
311 stars 67 forks source link

Check for UMI swapping #837

Open blackbeerd opened 2 years ago

blackbeerd commented 2 years ago

I have observed cases where my dual UMIs can get swapped during amplification steps in library prep, resulting in two consensus sequences which align to the same start/stop position but share the half of the dual UMI for a given PE. Note that my UMI are two random 6mers, one on each I5/I7 index, next to the fixed-based multiplexing index seq and are read out in the index reads (I manually append them to the reads prior to running DemuxFastq).

For example: two pairs of consensus sequences map to the same start/stop position and have UMIs "TAGCTA-ATCGAT" and "GCCGAT-ATCGAT," respectively (the I5 UMI's are different and the I7 UMIs are the same). The probability of this occurring by chance is very, very low (particularly when the two consensus pairs have the same alternate SNV, and the rest of the consensus sequences have reference allele, as in my data).

The solution to this would be to add a step after CallMolecularConsensusReads where you iterate through each chr/pos in a bam and look for half-matching UMIs with the same start/stop position. You could either flag those reads as duplicates or "fix" one set of UMIs to match the other?

Please consider adding this feature.

Happy to provide some test reads!

nh13 commented 2 years ago

This is super interesting, do you have more information on which prep you're using?

blackbeerd commented 2 years ago

Sure - we've got a couple of papers but the data pipelines are old. The fgbio tools working much better ;)

(publication links: https://pubmed.ncbi.nlm.nih.gov/33466369/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861971/, https://pubmed.ncbi.nlm.nih.gov/30833418/)

Basically the prep is as follows: 1) input cfDNA or sonicated DNA (~150 bp) 2) end-repair/A-tail with KAPA hyper prep kit 3) ligate custom dual-indexed Y-tailed (illumina-compatible) adapters (including flow cell binding sequences p5/p7, etc.) 4) PCR amplify using flow-cell binding sites (illumina primers) 5) Hybrid capture (IDT xGen custom targets) 6) Clean-up and amplify again 7) Sequence PE150

Dual-Indexed Degenerate-Adaptor (DIDA) sequences are as follows:

X=index N=degenerate barcode: i5 Adapter: AATGATACGGCGACCACCGAGATCTACACXXXXNNNNNNXXXXACACTCTTTCCCTACACGACGCTCTTCCGATC*T i7 Adapter: /5Phos/GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXNNNNNNXXXXATCTCGTATGCCGTCTTCTGCTTG

from Butler et al 2019: image

nh13 commented 2 years ago

Some obvious questions:

  1. is the UMI ATCGAT just more common in general?
  2. could this be a result of chimerism?
blackbeerd commented 2 years ago

1) there are multiple examples of this in a given library and the UMI involved don't seem to have any features in common (what I gave as an example was just made up) 2) I assume this is the same switchback "mechanism" that makes chimeras and I was hoping that using the FindSwitchbackReads in my pipeline would eliminate these. Maybe I need to adjust the params? I will say that otherwise the reads align well and don't appear to be clipped so there's no evidence of chimerism in the way they align.

blackbeerd commented 2 years ago

I have considered several mechanisms for these events and I'm thinking that since there isn't a significant amount of chimeras and the entire read typically maps fine, it is most likely the result of residual adaptor present during the amplification steps.

I'm trying to figure out the best way to flag these consensus read pairs and either mask or omit one of the two in a given set (keeping the consensus pair with the most raw-read support and least error, or something like that). I've written a pysam script that can identify them but it is prohibitively slow. I wonder if there is a way to modify the CallDuplexConsensusReads script to find and mark these read pairs?

nh13 commented 2 years ago

It's probably better to put it into FilterConsensusReads if it is flagging/masking/omitting consensus reads. Do you want to share your script?

blackbeerd commented 2 years ago

I'm realizing that its not going to work as is (even if it wasn't prohibitively slow). I was doing a simple iteration through the position-sorted (filtered) consensus bam and comparing tags between adjacent reads that have that same start/stop. This catches consensus sequences that have discordant umi1 and matching umi2 (tag format I previously described, ie., RX:Z:umi1-umi2), but misses matching umi1 and discordant umi2 unless umi2 are next to each other alphabetically.

To try to solve this, I wrote a script that just stores the template position and some tag (first read only) information for all read pairs in the bam and then goes through the list of unique read pair coordinates to group the read information by coordinate. My hope was to take the output of this (the "matches" dataframe) and generate a list of readnames that have either matching umi1 or umi2 and then uses SAMsift or similar tool to filter them out of my final bam (see script below). A bit of a hack, but I am a biologist by training and my computation skills are "in-progress" ;)

Output example: You can see the read pairs that only have one occurrence at a given start/stop coord. (match_count = 1) at the top and two start/stop coords that have two sets of read pairs each (in bold - sorry about the formatting). Both sets have matching umi2 and discordant umi1.

chr     start      stop                read_name    umi1    umi2 consensus_depth  consensus_error match_count

0 chr17 31325171 31325891 sbrt220303_B12:28270053 CATCCA ACGTAA 6 0.0 1 1 chr17 31325177 31325889 sbrt220303_B12:28270061 CGGTAA CGCGAG 4 0.0 1 2 chr17 31325231 31324907 sbrt220303_B12:28270005 AGAGCG ATTCCT 4 0.0 1 3 chr17 31325234 31325973 sbrt220303_B12:28270118 CGCCCA TCTTTC 5 0.0 1 4 chr17 31325247 31325960 sbrt220303_B12:28270120 TATTCG GTGTGA 7 0.0 1 ... ... ... ... ... ... ... ... ... ... 7648 chr17 31325882 31326020 sbrt220303_B12:28306661 AAGGGG CTCCGT 3 0.0 1 7649 chr17 31325882 31326024 sbrt220303_B12:28306674 CCGTGG ATACCA 3 0.0 1 7650 chr17 31325882 31326024 sbrt220303_B12:28306675 AGGTTT ATACCA 3 0.0 2 7651 chr17 31325882 31326028 sbrt220303_B12:28306678 GGAGAG TCCTGG 3 0.0 1 7652 chr17 31325882 31326028 sbrt220303_B12:28306679 AATCGG TCCTGG 3 0.0 2

This is, of course, too slow as it is and that's before I've even started to try to filter the bam....

def main(): bam = pysam.AlignmentFile("input.bam","rb") id="testrun" dirout="./" inbam = bam.fetch(until_eof = True ) df = pd.DataFrame([],columns=['read_name','chr', 'start','stop', 'umi1', 'umi2','consensus_depth','consensus_error']) print("Getting readnames, pos, and UMIs from bam file...") for rq1 in inbam: #generate dataframe of read names, coordinates, umis and tags for each read pair if rq1.is_paired and rq1.is_read1: rname_cur=rq1.query_name chr_cur=bam.getrname(rq1.tid) start_cur=rq1.reference_start stop_cur=rq1.reference_start+rq1.template_length umi_cur=rq1.get_tag("RX") umi1_cur=umi_cur.split("-")[0] umi2_cur=umi_cur.split("-")[1] cD_cur=rq1.get_tag("cD") cE_cur=rq1.get_tag("cE") dict2=[{'read_name': rname_cur, 'chr': chr_cur, 'start': start_cur, 'stop': stop_cur, 'umi1': umi1_cur, 'umi2' : umi2_cur, 'consensus_depth': cD_cur, 'consensus_error': cE_cur}] df=df.append(dict2, ignore_index=True) df_chr_pos=df.filter(['chr','start','stop'], axis=1) #get read pair coordinates only to new dataframe df_chr_pos_uniq=df_chr_pos.drop_duplicates().copy() #remove duplicates from dataframe matches=pd.DataFrame([],columns=['chr', 'start', 'stop','read_name','umi1','umi2', 'consensus_depth', 'consensus_error', 'match_count']) for index1, row1 in df_chr_pos_uniq.iterrows(): #start iterating through dataframe of read pair coordinates chr_stor=row1[0] start_stor=row1[1] stop_stor=row1[2] match_cnt=0 for index2, row2 in df.iterrows(): #For each position in coordinate list, output all matching read info from original dataframe to compare umis if row2[1] == chr_stor: if row2[2] != start_stor: continue if row2[3] != stop_stor: continue result_name=row2[0] result_umi1=row2[4] result_umi2=row2[5] result_cD=row2[6] result_cE=row2[7] match_cnt+=1 dict3=[{'chr': chr_stor , 'start': start_stor, 'stop': stop_stor, 'read_name': result_name ,'umi1': result_umi1, 'umi2': result_umi2, 'consensus_depth': result_cD , 'consensus_error': result_cE , 'match_count': match_cnt}] matches=matches.append(dict3, ignore_index=True) matches.to_csv('{0}/{1}.all_umis.txt'.format(dirout,id),index=False)