ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
144 stars 13 forks source link

Interpreting number of novel and known counts #124

Closed sparthib closed 4 months ago

sparthib commented 9 months ago

Hi @andrewprzh , based on a previous issue I understand that:

OUT.transcript_counts.tsv contains count of all known transcripts OUT.transcript_model_counts.tsv contains counts of all novel transcripts and OUT.transcript_models.gtf contains information of novel+known transcripts in the sample.

Now, when I look up number of transcripts in my OUT.transcript_models.gtf, (I tried awk '$3 == "transcript"{ count++ } END { print count }' OUT.transcript_models.gtf ), the number of transcripts is equal to the number of novel transcripts, i.e. information of known transcripts is missing, I am not sure if this is a bug or an update, or if I am misunderstanding something. Would appreciate your help with this. Attached my log file here.

Thanks! isoquant-1.log

andrewprzh commented 9 months ago

Dear @sparthib

OUT.transcript_counts.tsv contains count of all known transcripts

This is right!

OUT.transcript_model_counts.tsv contains counts of all novel transcripts

Not exactly, it contains both novel and expressed known transcripts.

OUT.transcript_models.gtf contains information of novel+known transcripts in the sample.

This GTF contains novel + expressed known. All known + novel are stored in OUT.extended_annoatation.gtf.

What worries me is that you log says you have 0 novel transcripts reported. This is quite weird considering the number of reads you have. By the default, IsoQuant does not report novel mono-exonic transcripts for ONT data (there is an --report_novel_unspliced option for that). However, I'd still expect some novel spliced isoforms to be reported. It also looks a bit suspicious that the gene database was converted so quickly, it usually takes hours.

Could you send me the input and output GTFs files as well?

Best Andrey

sparthib commented 9 months ago

thank you for your quick response @andrewprzh ,

hmm, the number of transcripts under OUT.transcript_model_counts.tsv is lesser than OUT.transcript_counts.tsv for some reason. they don't overlap OUT.transcript_counts.csv OUT.transcript_model_counts.csv I'm attaching them here as well.

Attaching my output gtf here in txt format, my input gtf is too big even compressed, so I'll attach a subset in a following comment. OUT.transcript_models_gtf.txt

Thanks!

sparthib commented 9 months ago

this is my reference gtf file

It's the basic gene annotation on the primary assembly (chromosomes and scaffolds) here https://www.gencodegenes.org/human/

sparthib commented 9 months ago

@andrewprzh , I checked the .db to see if there was something weird with it, the features table seems okay,

Checking the top 2 rows:

sqlite> select * from features limit 2;
ENSG00000290825.1|chr1|HAVANA|gene|11869|14409|.|+|.|{"gene_id":["ENSG00000290825.1"],"gene_type":["lncRNA"],"gene_name":["DDX11L2"],"level":["2"],"tag":["overlaps_pseudogene"]}|[]|4681
ENST00000456328.2|chr1|HAVANA|transcript|11869|14409|.|+|.|{"gene_id":["ENSG00000290825.1"],"transcript_id":["ENST00000456328.2"],"gene_type":["lncRNA"],"gene_name":["DDX11L2"],"transcript_type":["lncRNA"],"transcript_name":["DDX11L2-202"],"level":["2"],"transcript_support_level":["1"],"tag":["basic","Ensembl_canonical"],"havana_transcript":["OTTHUMT00000362751.1"]}|[]|4681

Getting the number of rows, which is the same as my input gtf:

sqlite> select count(*) from features;
1999109

Let me know if there is anything else I can look into in the db.

Thanks,

Sowmya

sparthib commented 9 months ago

Hi @andrewprzh ,

I am trying to understand the isoquant.log file better. What does it exactly mean when it says processing chromosome ... For the above reads aligned with a different fasta file (GENCODE transcript sequences) using minimap2, my current run looks something like this:

2023-12-03 10:28:20,430 - INFO - Collecting read alignments
2023-12-03 10:29:04,256 - INFO - Processing chromosome ENST00000674361.1|ENSG00000241743.4|OTTHUMG00000022220.5|OTTHUMT00000530527.1|XACT-203|XACT|347561|lncRNA|
2023-12-03 10:29:04,529 - INFO - Processing chromosome ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|HELLPAR-201|HELLPAR|205012|lncRNA|

These are clearly not chromosomes but rather transcript IDs, so I don't know if it makes sense.

andrewprzh commented 9 months ago

Dear @sparthib

As you pointed out in another thread, IsoQuant needs a reference genome, so the results with transcriptome can be really weird and may not make a lot of sense.

Probably I should've caught that earlier :)

Best Andrey

sparthib commented 9 months ago

Hi @andrewprzh ,

Thank you for your response, the initial log file I shared was run on a reference genome, I wasn't sure if that was the issue so later I tested it out with the reference transcriptome. So the initial issue still exists.

andrewprzh commented 9 months ago

Dear @sparthib

Yes, indeed, the counts overlap, but also contain a lot of distinct isoforms. Also, not novel isoforms were reported during the run, which sounds weird too.

Could you possibly send me a small part of the bam file, e.g. some small chromosome? I guess it's quite hard to debug this without initial data.

Best Andrey

sparthib commented 7 months ago

Hi @andrewprzh sorry about the delayed response, attaching a sub sample bam file here. Thanks! sample.bam.gz

biozhangzhou commented 7 months ago

Dear @sparthib

OUT.transcript_counts.tsv contains count of all known transcripts

This is right!

OUT.transcript_model_counts.tsv contains counts of all novel transcripts

Not exactly, it contains both novel and expressed known transcripts.

OUT.transcript_models.gtf contains information of novel+known transcripts in the sample.

This GTF contains novel + expressed known. All known + novel are stored in OUT.extended_annoatation.gtf.

What worries me is that you log says you have 0 novel transcripts reported. This is quite weird considering the number of reads you have. By the default, IsoQuant does not report novel mono-exonic transcripts for ONT data (there is an --report_novel_unspliced option for that). However, I'd still expect some novel spliced isoforms to be reported. It also looks a bit suspicious that the gene database was converted so quickly, it usually takes hours.

Could you send me the input and output GTFs files as well?

Best Andrey

Which kind of "expressed known transcripts" would be included in the OUT.transcript_model_counts.tsv file ? And, could you please provide more details about each result file ? Thanks a lot.

andrewprzh commented 7 months ago

@biozhangzhou

Which kind of "expressed known transcripts" would be included in the OUT.transcript_model_counts.tsv file ?

Transcripts that are have enough read support are reported in OUT.transcript_models.gtf. OUT.transcript_model_counts.tsv contains count information for all the transcript in this GTF.

And, could you please provide more details about each result file ?

Details on IsoQuant output is provided in the user manual https://github.com/ablab/IsoQuant?tab=readme-ov-file#sec3.3 Please open new issue if something is unclear.

Best Andrey

biozhangzhou commented 7 months ago

Thanks.

andrewprzh commented 4 months ago

New IsoQuant 3.4.1 is released. Reference transcript counts and transcript model counts should way more similar in this version, so closing this issue for now.