PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
249 stars 44 forks source link

identify transcripts with polyA tails #689

Closed ZachLW closed 3 months ago

ZachLW commented 4 months ago

Hi all,

Thank you very much for developing these helpful packages! At the 'isoseq refine' step I set the 'require-polyA' parameter thus I guess the refined molecules should have polyA tails. However, at the 'pigeon classify' step I introduced the '--poly-a polyA.list' parameter (polyA.list downloaded from the link provided in the https://isoseq.how/), thus here is a 'polyA motif' column in the transcript annotation file generated by pigeon. I found that there are many NAs in the polyA motif column, so I'm wondering whether this suggest that these transcripts don't have polyA tails or they have non-cannonical polyA tails that are not included in the polyA list provided? If they don't have polyA tails then how did these molecules pass the 'require-polyA' parameter in the 'isoseq refine' step? BTW, I'm also wondering did the name of these molecules, full length tagged non-concatemer reads (FLTNC reads), means they were full length transcripts? Thanks for any help in advance!

Kind regards, Zach

Magdoll commented 3 months ago

Hi @ZachLW - sorry for the confusion. The two polyA mean different things here.

In isoseq refine, the --require-polyA means it will remove polyA tails (stretches of A) at the 3' end of the FL read. So after isoseq refine, the reads (now called FLNC reads) will be stripped of 5' / 3' cDNA primers and the polyA tail and just be the transcript insert itself.

In pigeon classify, the --poly-a is actually looking for polyadenylation signal that is a 6-mer in the genomic region right upstream of the 3' end of a mRNA transcript. This is the paper that best describes it: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310884/

The common human 6-mer polyA list is here: https://downloads.pacbcloud.com/public/dataset/Kinnex-single-cell-RNA/REF-pigeon_ref_sets/Human_hg38_Gencode_v39/polyA.list.txt

Hope this helps!