Question about exon counting

weib3 commented 2 weeks ago

Hello，

Thanks for developing such a great tool. I am using IsoQuant to analyze ONT single-cell full-length transcriptome data and would like to obtain exon count information to calculate PSI. In the *.exon_grouped_counts.tsv file, I only find counts of known exons and no novel exons. Where can I obtain such information, or is there a parameter I can use to control this?

My command is below: isoquant.py --reference GRCh38.fa \ --genedb gencode.v46.primary_assembly.annotation.gtf \ --bam $bam \ --read_group tag:BC \ --count_exons \ --complete_genedb \ --threads 1 \ --data_type nanopore -o $outdir

Additionally, my data comes from ScNaUmi-seq. I analysis raw ONT data with SiCeLoRe to get consensus reads, and IsoQuant to get exon count. However, in the IsoQuant log files, there has an warning: polyA tail detected in 381095 (4.0%). PolyA percentage is suspiciously low. Is this normal? Or I need to rerun with --polya_requirement never?

Wei

andrewprzh commented 2 weeks ago

Dear @weib3

Thank you for your feedback!

Yes, --count_exons only counts annotated exons. Unfortunately, at the moment there is no functionality that would include novel exons. Since you have already processed the data, I recommend to simply run IsoQuant again, providing OUT.extended_annotation.gtf from the previous run as an input. You can also set --no_model_construction since you have already constructed novel isoforms.

Regarding consensus reads. IsoQuant can process raw reads, there is no need to search for consensus in general. I'm not really aware of the protocol you have mentioned, but I guess it is designed to get consensus reads. If your consensus reads are accurate enough you may try using -data_type pacbio as well (although I don't expect dramatic changes).

As to polyA tails, of course, it's nice to keep them as they support transcript end positions. However, by default IsoQuant automatically decides whether to use them or not. In this case, polyA requirement should be set to "never" automatically. If you are sure consensus reads represent full molecules, you can also try using --fl_data flag. Overall, I think your analysis is fine anyway.

Best Andrey

weib3 commented 2 weeks ago

Hi @andrewprzh ,

Very thanks for your quick and detailed reply. Run IsoQuant twice seems to be a good solution. In addition, I will seriously consider your suggestions about the -data_type pacbio and --fl_data parameters.

Best Wei

ablab / IsoQuant

Question about exon counting #205