adnaniazi / tailfindr

An R package for estimating poly(A)-tail lengths in Oxford Nanopore RNA and DNA reads.
https://www.cbu.uib.no/valen/
GNU General Public License v3.0
53 stars 18 forks source link

invalid read type and tail_is_valid FALSE #27

Closed jon-xu closed 2 years ago

jon-xu commented 2 years ago

Hi Adnaniazi,

Could you please help to check whether I got invalid read type and FALSE tail_is_valid thus no tail length estimated? It's a cDNA sample.

The result file: https://cloudstor.aarnet.edu.au/plus/s/ml02t0AAm3MdSWO One of the fast5 files: https://cloudstor.aarnet.edu.au/plus/s/wgsMP0NQs4pfTix

Many thanks, Jon

adnaniazi commented 2 years ago

Hi Jon,

Thanks for using tailfindr.

I assume that you are using tailfindr to fond polyA/polyT lengths in cDNA. Can I ask you what protocol/kit are you using to generate this cDNA.

Best, Adnan

jon-xu commented 2 years ago

Hi Adnan,

You are right, I used a wrong configuration file for basecalling.

Will try again and let you know.

Cheers, Jon

jon-xu commented 2 years ago

Hi Adnan,

after applying correct configuration file, tailfindr works fine.

Thanks! Jon

adnaniazi commented 2 years ago

Great! Thanks for the update.

jon-xu commented 2 years ago

Hi Adnan,

Sorry I was looking at a wrong result.

After using the correct configure file in basecalling, there still seems to be problem in the result: https://cloudstor.aarnet.edu.au/plus/s/u45OicPnx2ro2Og

And here is the sample fast5: https://cloudstor.aarnet.edu.au/plus/s/0oXhg7vO1tQnDRE

Thanks! Jon

jon-xu commented 2 years ago

some read has type "polyA" and some still invalid. And for the polyA ones, tail_is_valid is FALSE... we used SQK-PCB109 kit for the cDNA.

adnaniazi commented 2 years ago

Hi Jon,

SQK-PCB109 is not suitable for doing polyA/polyT profiling. This is because the polyT primer can anneal anywhere in the polyA stretch of the RNA (see this figure) and therefore the estimated polyA/polyT would mostly be an underestimate of the true polyA tail length.

If you want correct estimates of polyA/polyT tails, then you have to use the SQK-PCS111 kit. This kit uses a special primer with overhang which ensures that the full polyA tail is amplified during creation of the cDNA. tailfindr only works with this kit because there is no other Nanopore kit that can successfully amplify the cDNA with full-length polyA/T tails.

Best, Adnan

jon-xu commented 2 years ago

Understand! Thanks Adnan!

But will tailfindr still include the length estimate if it is SQK-PCB109, even though it might not be accurate?

Cheers, Jon

adnaniazi commented 2 years ago

Yes, it should work provided you specify the correct front and end primer sequences when calling tailfindr.

Please see section 5 of the tailfindr readme (5. Specifying custom cDNA primers). Just change the front (FP) and end primer (EP) sequences to whatever SQK-PCB109 uses and it would work albeit the predictions won't be correct.

jon-xu commented 2 years ago

Hi Adnan,

For PCR-cDNA Barcoding Kit (SQK-PCB109) The top and bottom strand of this primer carry different flanking sequences: 5' - ATCGCCTACCGTGAC - barcode - ACTTGCCTGTCGCTCTATCTTC - 3' 5' - ATCGCCTACCGTGAC - barcode - TTTCTGTTGGTGCTGATATTGC - 3'

Which one is the FP and which one is the EP? I have tried using "ATCGCCTACCGTGAC" as FP and "ACTTGCCTGTCGCTCTATCTTC" as EP, the result remains the same as not specifying them.

Thanks! Jon

adnaniazi commented 2 years ago

Hi Jon,

I am not very familiar with this kit, but please refer to the diagram below for knowing the positions of the FP and EP:

cdna_construct

So FP is the sequence that is located immediately to the right of the 5'-end of the mRNA-oriented strand. The EP is the sequence located immediately to the left of reverse-complement cDNA. FP and EP are just names; the important thing is the sequence information of these two entities for your experiment/kit.

I have highlighted the FP and EP positions in green the SQK-PCB109 protocol below.

Screenshot 2022-05-06 at 08 51 41

Based on the information I have provided above, please find the correct sequences for FP and EP and then use those.

Adnan

jon-xu commented 2 years ago

Thanks Adnan! After using the correct FP/EP, we got some results. But about 20% of the reads were marked as FALSE for "tail_is_valid". Is it too much or normal according to your experience, please?

adnaniazi commented 2 years ago

Yes such high number of invalid tail is normal because of two reasons:

  1. If primer is a short sequence and has too many indels then we cannot find it with high enough confidence and in that case, we consider the tail as invalid.
  2. In case of poly(A) reads, the polyA tail is found at the very end of the read. Very many times we do not find the primer that comes after the polyA tail. This is because the read prematurely terminates due to motor protein dropping off when it reaches the end. In that case we classify the read to be invalid because the motor protein could have dropped off in the middle of the polyA stretch. Without a primer sequence next to the polyA tail we cannot be sure if we have captured the full length tail or not. So any read where the polyA tail does not have a primer sequence next to it is classified as invalid tail.
jon-xu commented 2 years ago

Thank you very much!!