calacademy-research / minibar

Dual barcode and primer demultiplexing for MinION sequenced reads
BSD 2-Clause "Simplified" License
35 stars 5 forks source link

Issues with reverse complemented reads #11

Open meganduval opened 11 months ago

meganduval commented 11 months ago

Hello!

I'm working with a portion of the COI gene in ants, and have been using minibar to demultiplex my reads. We've developed our own indexes, here's an example of what they look like from my input barcode file: FwIndex: aagcactgcgttcgttatcggagt RvIndex: cgtgccaatgaaaccttgactaag

I've been running the following line in Python 3.9.12, with -e 10 based on the barcode edit distance values of my indexes: python minibar.py primers_pool_universal.txt concat_sequence.fastq -e 10 -T -F

Although this command gives me demultiplexed reads for each of my samples, many of the reads are reverse complemented, which is causing problems when I try to align the reads and create a consensus sequence. I've tried testing different parameters and options (for example, including reverse complemented primers and indexes in my barcode file), but haven't been able to get around this problem. Do you know what could be going on and how to fix the issue of reverse complemented reads?

Thank you for your help, I really appreciate this tool!

Megan

jbh-cas commented 11 months ago

The molecule can go through the pore either way so there is no controlling that, but the information about which barcode was found is in those cryptic letters at the end of the header. For example:

H-(3,2),H+(0,1) FI1 H+(0,3),h-(4,-1) FI1 h-(4),h+(4) FI1

The H- or h- as the first item means that the reverse barcode or primer was found at the beginning.

A few things come to mind.

One is you could have the multiple sequence aligner handle that for you. For example, mafft has --adjustdirection or --adjustdirectionaccurately options to do a reverse complement of the sequence if necessary and marks "_R_" at the head of sequence title when it revcomps. BTW, I find it useful to use --preservecase with mafft so it does not fold everything to lowercase.

I'm not sure what portion of COI you capture but the whole gene should be around 1550bp so you will have most to all of it in ONT long sequences. You can use blast against a representative COI and it will show minus strand matches if the sequence should be revcomped.

Also you could do it with a script for sequences that have H- or h- as first match indicators. I find bioawk from Heng Li (who also created bwa and minimap2) to be very useful. We have added extensions to his bioawk in a version named bioawk_cas and either will work for this.

You could make a script with this contents:

#!/bin/bash

bioawk -c fastx '

   # see if reverse barcode or primer found at seq start
   $comment ~ " H-" || $comment ~ " h-" {  # if so reverse comp the sequence and append _RC to record name
      revcomp($seq); $name = $name "_RC"
   }

   # print the fastq record
   {
      print "@" $name " " $comment
      print $seq
      print "+"
      print $qual
   }
' $1  # $1 is first arg or if no arg than read from stdin

Say script is named revcomp_if_needed.sh and your input file is named COI.fastq. Then this should do it:

bash revcomp_if_needed.sh COI.fastq > COI_w_RC.fastq

Personally, I like the first option of having the MSA program handle it but advantages to them all.

meganduval commented 11 months ago

I tried using --adjustdirection on mafft and that worked. Thank you so much, your suggestions were very helpful!

Callithrix-omics commented 10 months ago

thanks a lot for this thread. I also had this issue and found the suggestions very helpful.