adnaniazi / tailfindr

An R package for estimating poly(A)-tail lengths in Oxford Nanopore RNA and DNA reads.
https://www.cbu.uib.no/valen/
GNU General Public License v3.0
53 stars 16 forks source link

0.1.0 branch questions #67

Open haannguyen opened 8 months ago

haannguyen commented 8 months ago

Hi Adnan,

Thank you so much for the great software! I have been trying to use it to look at the content of the polyA tails in my reaction. I have a few questions:

  1. the polya_tail_fasta_seq column output doesn't match the length in the tail_length column? ie. tail_length is 95.6 but polya_tail_fasta_seq is 'UAAAAAAGU'. How should I interpret this data?
  2. is there any confidence/quality metrics for the estimations? i.e. how confident are we in the non-A calls in 'UAAAAAAGU'?
  3. do you have any idea about stats the baseline or expected noise for the outputs? I have a sample that is supposed to be all As, but the polya_tail_fasta_seq output contains only ~71% As. Is there a way to set thresholds for tail content estimates?

Thank you so much.

Yours, Ha An

adnaniazi commented 7 months ago

Hi,

Thank you for using tailfindr.

Here are answers to your questions:

  1. When tail lengths are greater than 10-15nt, the basecaller is not very accurate in outputting the corresponding number of bases. For example, a tail might actually be 30nt long, but the corresponding tail region might only have 12 As in that region because the basecaller struggles to output the correct number of homopolymer bases. That's the whole reason that tools like tailfindr exist to accurately predict the tail length because the basecaller struggles to output the correct number of bases for long tails.
  2. No, we dont have any confidence metric for the non-A calls. I just find a monotonous signal region that is supposed to represent the signal corresponding to the polyA tail, and then I output the corresponding bases predicted by the basecaller in this region.
  3. If a read is just the adapter, followed by all As, and nothing else, then tailfindr out would not be reliable. For tailfindr to work properly, you must have the sequencing adapter, followed by polyA tail, followed by some non-polyA sequence. Best, Adnan
haannguyen commented 7 months ago

Hi Adnan,

Thanks for the detailed reply! I see, I did not realize that the polya_tail_fasta_seq is pulling the data from the basecalled sequence and not reprocessing the data itself to output this.

If a read is just the adapter, followed by all As, and nothing else, then tailfindr out would not be reliable. For tailfindr to work properly, you must have the sequencing adapter, followed by polyA tail, followed by some non-polyA sequence.

yes I do have a non-polyA sequence attached to a polyA tail (that is supposed to be A only), and I get a 'polya_tail_fasta_seq' output that is only ~70% A. Is the nanopore basecalling this noisy/error-prone? For clarity, I am trying to compare a tailing reaction with ATP vs mixed nucleotides and trying to assess tail content.

Thank you for all your help!!