Closed houlinyu closed 6 months ago
@houlinyu
In case you provided the reference annotation (which I presume you did), it seems that the problem can be in chromosome naming. Do chromosome names in your fasta and gtf files match? Could you send me the isoquant.log file?
Best Andrey
isoquant.log @andrewprzh many thanks
@houlinyu
In case you provided the reference annotation (which I presume you did), it seems that the problem can be in chromosome naming. Do chromosome names in your fasta and gtf files match? Could you send me the isoquant.log file?
Best Andrey
@andrewprzh the 'chromosome' names match, but as you may already know, in SIRVs SIRV-Set4 there are records of one gene as one chromosome.
thanks, Houlin
@houlinyu
Thanks for the data, now I see the problem.
You have used --complete_genedb
option, but the annotation contains only exon entries.
--complete_genedb
should only be used when the GTF file has transcript and gene features too.
To fix that I suggest to remove existing gene database
/home/jupyter/projects/iso_quantification_benchmark/sirvs_from_raw/processed_reads/Isoquant_OUT/td_tso/SIRV_ERCC_longSIRV_multi-fasta_20210507.db
and re-run IsoQuant without --complete_genedb
option.
Hope that helps!
Best Andrey
Add another note: we made the simplest simulation and ran Isoquant- feeding in the input (SIRV_ERCC_longSIRV_multi-fasta_20210507.transcript.fa
, transcript sequences extracted by tool gffread
using SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta
and SIRV_ERCC_longSIRV_multi-fasta_20210507.gtf
) and genome reference (SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta
). there are 176 SIRV reads corresponding to 176 SIRV transcripts (1 read per transcript). the result showed again all reads were classified as intergenic.
@houlinyu Thanks for the data, now I see the problem. You have used
--complete_genedb
option, but the annotation contains only exon entries.--complete_genedb
should only be used when the GTF file has transcript and gene features too.To fix that I suggest to remove existing gene database
/home/jupyter/projects/iso_quantification_benchmark/sirvs_from_raw/processed_reads/Isoquant_OUT/td_tso/SIRV_ERCC_longSIRV_multi-fasta_20210507.db
and re-run IsoQuant without
--complete_genedb
option.Hope that helps!
Best Andrey
@andrewprzh Got it! thanks for the suggestions!
Thanks, Houlin
Add another note: we made the simplest simulation and ran Isoquant- feeding in the input (
SIRV_ERCC_longSIRV_multi-fasta_20210507.transcript.fa
, transcript sequences extracted by toolgffread
usingSIRV_ERCC_longSIRV_multi-fasta_20210507.fasta
andSIRV_ERCC_longSIRV_multi-fasta_20210507.gtf
) and genome reference (SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta
). there are 176 SIRV reads corresponding to 176 SIRV transcripts (1 read per transcript). the result showed again all reads were classified as intergenic.
@andrewprzh the results after removing the existing gene database and not using --complete_genedb
make much more sense now. however, in our simplest simulation exercise. Isoquant still failed to quantify any isoforms from the LongSIRV category. I ran the --no_model_construction
mode and here is the end of the Isoquant Log. Appreciate further discussion.
848 - INFO - Read assignment statistics 848 - INFO - ambiguous: 1 848 - INFO - intergenic: 15 849 - INFO - unique: 160 856 - INFO - Processed sample OUT 856 - INFO - Processed 1 sample 856 - INFO - === IsoQuant pipeline finished ===
intergenic: 15 - this belongs to 15 long SIRVs.
One thought would be those long SIRVs are longer than others and Isoquant requires the minimum read support for these long ones, even in the --no_model_construction
mode?
@houlinyu
Read assignment is performed individually for each read, so no minimal support is required. Previously, I have not seen any problems with long isoforms.
Could you please send me the entire output folder and the BAM files so I can investigate the issue?
Best Andrey
@andrewprzh
Many thanks for your quick responses.
Here are the outputs from the folder, which also include the bam files. Isoquant Output from a simple-simulation data (1 full-length read/ transcript, no sequence errors)
Thanks, Houlin
@houlinyu
As I anticipated, the problem relates to gffutils, which IsoQuant uses for GTF processing. Somehow it does not create gene entries for long SIRV exons. IsoQuant then iterates over genes and cannot find any genes on long SIRV "chromosomes", hence all alignments are marked as intergenic.
I'll dig a bit more to figure out whether this can be easily fixed without reporting the problem to gffutils authors.
Best Andrey
@andrewprzh Andrey, thanks for the response- it makes sense. Hope this can be easily fixed. Thanks, Houlin
@houlinyu
Now I got it. I think I might have seen this before. Gffutils freaks out when gene_id and transcript_id are equal (I guess because it creates an SQL database and all keys must be unique). This was exactly the case for long SIRV entries.
I attached a slightly modified GTF that should work just fine. SIRV_ERCC_longSIRV_multi-fasta_20210507.zip Just don't forget to remove old db file :)
I checked on your BAM file and got 2024-02-21 00:29:45,487 - INFO - ambiguous: 1 2024-02-21 00:29:45,487 - INFO - unique: 175
I'll add some additional checks in the future releases to warn users in case something like this happens.
Best Andrey
@andrewprzh
Sounds great! Many thanks for your help.
Best, Houlin
Would the gff3utils key issue be the problem as well when a reference sequence has the same name as a gene? The isoquant traceback is below.
Taken from the genome .fai file to show the ERCC spike in sequence is present in the genome fasta file ERCC-00033 2014 2228483004 100 101 and here is the matching GFF3 record ERCC-00033 ERCC gene 1 2013 . + . ID=ERCC-00033;Note="RNA spike-in" ERCC-00033 ERCC mRNA 1 2013 . + . ID=ERCC-00033.m1;Parent=ERCC-00033;Note="RNA spike-in" ERCC-00033 ERCC exon 1 2013 . + . ID=ERCC-00033.m1.e1;Parent=ERCC-00033.m1
Traceback:
concurrent.futures.process._RemoteTraceback:
3866 """
3867 Traceback (most recent call last):
3868 File "python3.8/concurrent/futures/process.py", line 239, in _process_worker
3869 r = call_item.fn(*call_item.args, *call_item.kwargs)
3870 File "python3.8/concurrent/futures/process.py", line 198, in _process_chunk
3871 return [fn(args) for args in chunk]
3872 File "python3.8/concurrent/futures/process.py", line 198, in
@petersbrC
The error message looks odd, I don't thing identical IDs of a FASTA record and a gene name should cause the problem... Could you send me the BAM file and the reference files if possible?
Best Andrey
I added GTF consistency checks in IsoQuant 3.4, so that such issues could be detected easily.
we have been running Isoquant, the reference-based mode on Pacbio sequenced SIRVs samples, and encountered an issue that 1. isoform reported all as novel isoforms, but we can find most of them match the transcripts in the reference using gffcompare, and 2. all the reads are classified as intergenic against the reference. We also ran the data in the reference-based quantification-only mode, and it does not count any reads into the transcript reference. Appreciate any hint and/or further discussion. We are using the most latest version 3.3.1.