Closed sparthib closed 4 months ago
Dear @sparthib
OUT.transcript_counts.tsv contains count of all known transcripts
This is right!
OUT.transcript_model_counts.tsv contains counts of all novel transcripts
Not exactly, it contains both novel and expressed known transcripts.
OUT.transcript_models.gtf contains information of novel+known transcripts in the sample.
This GTF contains novel + expressed known. All known + novel are stored in OUT.extended_annoatation.gtf
.
What worries me is that you log says you have 0 novel transcripts reported. This is quite weird considering the number of reads you have. By the default, IsoQuant does not report novel mono-exonic transcripts for ONT data (there is an --report_novel_unspliced
option for that). However, I'd still expect some novel spliced isoforms to be reported.
It also looks a bit suspicious that the gene database was converted so quickly, it usually takes hours.
Could you send me the input and output GTFs files as well?
Best Andrey
thank you for your quick response @andrewprzh ,
hmm, the number of transcripts under OUT.transcript_model_counts.tsv
is lesser than OUT.transcript_counts.tsv
for some reason. they don't overlap
OUT.transcript_counts.csv
OUT.transcript_model_counts.csv
I'm attaching them here as well.
Attaching my output gtf here in txt format, my input gtf is too big even compressed, so I'll attach a subset in a following comment. OUT.transcript_models_gtf.txt
Thanks!
It's the basic gene annotation on the primary assembly (chromosomes and scaffolds) here https://www.gencodegenes.org/human/
@andrewprzh ,
I checked the .db to see if there was something weird with it, the features
table seems okay,
Checking the top 2 rows:
sqlite> select * from features limit 2;
ENSG00000290825.1|chr1|HAVANA|gene|11869|14409|.|+|.|{"gene_id":["ENSG00000290825.1"],"gene_type":["lncRNA"],"gene_name":["DDX11L2"],"level":["2"],"tag":["overlaps_pseudogene"]}|[]|4681
ENST00000456328.2|chr1|HAVANA|transcript|11869|14409|.|+|.|{"gene_id":["ENSG00000290825.1"],"transcript_id":["ENST00000456328.2"],"gene_type":["lncRNA"],"gene_name":["DDX11L2"],"transcript_type":["lncRNA"],"transcript_name":["DDX11L2-202"],"level":["2"],"transcript_support_level":["1"],"tag":["basic","Ensembl_canonical"],"havana_transcript":["OTTHUMT00000362751.1"]}|[]|4681
Getting the number of rows, which is the same as my input gtf:
sqlite> select count(*) from features;
1999109
Let me know if there is anything else I can look into in the db.
Thanks,
Sowmya
Hi @andrewprzh ,
I am trying to understand the isoquant.log
file better. What does it exactly mean when it says processing chromosome ...
For the above reads aligned with a different fasta file (GENCODE transcript sequences) using minimap2, my current run looks something like this:
2023-12-03 10:28:20,430 - INFO - Collecting read alignments
2023-12-03 10:29:04,256 - INFO - Processing chromosome ENST00000674361.1|ENSG00000241743.4|OTTHUMG00000022220.5|OTTHUMT00000530527.1|XACT-203|XACT|347561|lncRNA|
2023-12-03 10:29:04,529 - INFO - Processing chromosome ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|HELLPAR-201|HELLPAR|205012|lncRNA|
These are clearly not chromosomes but rather transcript IDs, so I don't know if it makes sense.
Dear @sparthib
As you pointed out in another thread, IsoQuant needs a reference genome, so the results with transcriptome can be really weird and may not make a lot of sense.
Probably I should've caught that earlier :)
Best Andrey
Hi @andrewprzh ,
Thank you for your response, the initial log file I shared was run on a reference genome, I wasn't sure if that was the issue so later I tested it out with the reference transcriptome. So the initial issue still exists.
Dear @sparthib
Yes, indeed, the counts overlap, but also contain a lot of distinct isoforms. Also, not novel isoforms were reported during the run, which sounds weird too.
Could you possibly send me a small part of the bam file, e.g. some small chromosome? I guess it's quite hard to debug this without initial data.
Best Andrey
Hi @andrewprzh sorry about the delayed response, attaching a sub sample bam file here. Thanks! sample.bam.gz
Dear @sparthib
OUT.transcript_counts.tsv contains count of all known transcripts
This is right!
OUT.transcript_model_counts.tsv contains counts of all novel transcripts
Not exactly, it contains both novel and expressed known transcripts.
OUT.transcript_models.gtf contains information of novel+known transcripts in the sample.
This GTF contains novel + expressed known. All known + novel are stored in
OUT.extended_annoatation.gtf
.What worries me is that you log says you have 0 novel transcripts reported. This is quite weird considering the number of reads you have. By the default, IsoQuant does not report novel mono-exonic transcripts for ONT data (there is an
--report_novel_unspliced
option for that). However, I'd still expect some novel spliced isoforms to be reported. It also looks a bit suspicious that the gene database was converted so quickly, it usually takes hours.Could you send me the input and output GTFs files as well?
Best Andrey
Which kind of "expressed known transcripts" would be included in the OUT.transcript_model_counts.tsv file ? And, could you please provide more details about each result file ? Thanks a lot.
@biozhangzhou
Which kind of "expressed known transcripts" would be included in the OUT.transcript_model_counts.tsv file ?
Transcripts that are have enough read support are reported in OUT.transcript_models.gtf. OUT.transcript_model_counts.tsv contains count information for all the transcript in this GTF.
And, could you please provide more details about each result file ?
Details on IsoQuant output is provided in the user manual https://github.com/ablab/IsoQuant?tab=readme-ov-file#sec3.3 Please open new issue if something is unclear.
Best Andrey
Thanks.
New IsoQuant 3.4.1 is released. Reference transcript counts and transcript model counts should way more similar in this version, so closing this issue for now.
Hi @andrewprzh , based on a previous issue I understand that:
OUT.transcript_counts.tsv
contains count of all known transcriptsOUT.transcript_model_counts.tsv
contains counts of all novel transcripts andOUT.transcript_models.gtf
contains information of novel+known transcripts in the sample.Now, when I look up number of transcripts in my
OUT.transcript_models.gtf
, (I triedawk '$3 == "transcript"{ count++ } END { print count }' OUT.transcript_models.gtf
), the number of transcripts is equal to the number of novel transcripts, i.e. information of known transcripts is missing, I am not sure if this is a bug or an update, or if I am misunderstanding something. Would appreciate your help with this. Attached my log file here.Thanks! isoquant-1.log