SIRVs data - Githubissues

houlinyu commented 9 months ago

we have been running Isoquant, the reference-based mode on Pacbio sequenced SIRVs samples, and encountered an issue that 1. isoform reported all as novel isoforms, but we can find most of them match the transcripts in the reference using gffcompare, and 2. all the reads are classified as intergenic against the reference. We also ran the data in the reference-based quantification-only mode, and it does not count any reads into the transcript reference. Appreciate any hint and/or further discussion. We are using the most latest version 3.3.1.

andrewprzh commented 9 months ago

@houlinyu

In case you provided the reference annotation (which I presume you did), it seems that the problem can be in chromosome naming. Do chromosome names in your fasta and gtf files match? Could you send me the isoquant.log file?

Best Andrey

houlinyu commented 9 months ago

isoquant.log @andrewprzh many thanks

houlinyu commented 9 months ago

@houlinyu

In case you provided the reference annotation (which I presume you did), it seems that the problem can be in chromosome naming. Do chromosome names in your fasta and gtf files match? Could you send me the isoquant.log file?

Best Andrey

@andrewprzh the 'chromosome' names match, but as you may already know, in SIRVs SIRV-Set4 there are records of one gene as one chromosome.

thanks, Houlin

andrewprzh commented 9 months ago

@houlinyu Thanks for the data, now I see the problem. You have used --complete_genedb option, but the annotation contains only exon entries. --complete_genedb should only be used when the GTF file has transcript and gene features too.

To fix that I suggest to remove existing gene database

/home/jupyter/projects/iso_quantification_benchmark/sirvs_from_raw/processed_reads/Isoquant_OUT/td_tso/SIRV_ERCC_longSIRV_multi-fasta_20210507.db

and re-run IsoQuant without --complete_genedb option.

Hope that helps!

Best Andrey

houlinyu commented 9 months ago

Add another note: we made the simplest simulation and ran Isoquant- feeding in the input (SIRV_ERCC_longSIRV_multi-fasta_20210507.transcript.fa, transcript sequences extracted by tool gffread using SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta and SIRV_ERCC_longSIRV_multi-fasta_20210507.gtf) and genome reference (SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta). there are 176 SIRV reads corresponding to 176 SIRV transcripts (1 read per transcript). the result showed again all reads were classified as intergenic.

houlinyu commented 9 months ago

@houlinyu Thanks for the data, now I see the problem. You have used --complete_genedb option, but the annotation contains only exon entries. --complete_genedb should only be used when the GTF file has transcript and gene features too.

To fix that I suggest to remove existing gene database
/home/jupyter/projects/iso_quantification_benchmark/sirvs_from_raw/processed_reads/Isoquant_OUT/td_tso/SIRV_ERCC_longSIRV_multi-fasta_20210507.db
and re-run IsoQuant without --complete_genedb option.

Hope that helps!

Best Andrey

@andrewprzh Got it! thanks for the suggestions!

Thanks, Houlin

houlinyu commented 9 months ago

Add another note: we made the simplest simulation and ran Isoquant- feeding in the input (SIRV_ERCC_longSIRV_multi-fasta_20210507.transcript.fa, transcript sequences extracted by tool gffread using SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta and SIRV_ERCC_longSIRV_multi-fasta_20210507.gtf) and genome reference (SIRV_ERCC_longSIRV_multi-fasta_20210507.fasta). there are 176 SIRV reads corresponding to 176 SIRV transcripts (1 read per transcript). the result showed again all reads were classified as intergenic.

@andrewprzh the results after removing the existing gene database and not using --complete_genedb make much more sense now. however, in our simplest simulation exercise. Isoquant still failed to quantify any isoforms from the LongSIRV category. I ran the --no_model_construction mode and here is the end of the Isoquant Log. Appreciate further discussion.

848 - INFO - Read assignment statistics 848 - INFO - ambiguous: 1 848 - INFO - intergenic: 15 849 - INFO - unique: 160 856 - INFO - Processed sample OUT 856 - INFO - Processed 1 sample 856 - INFO - === IsoQuant pipeline finished ===

intergenic: 15 - this belongs to 15 long SIRVs.

One thought would be those long SIRVs are longer than others and Isoquant requires the minimum read support for these long ones, even in the --no_model_construction mode?

andrewprzh commented 9 months ago

@houlinyu

Read assignment is performed individually for each read, so no minimal support is required. Previously, I have not seen any problems with long isoforms.

Could you please send me the entire output folder and the BAM files so I can investigate the issue?

Best Andrey

houlinyu commented 9 months ago

@andrewprzh

Many thanks for your quick responses.

Here are the outputs from the folder, which also include the bam files. Isoquant Output from a simple-simulation data (1 full-length read/ transcript, no sequence errors)

Thanks, Houlin

andrewprzh commented 9 months ago

@houlinyu

As I anticipated, the problem relates to gffutils, which IsoQuant uses for GTF processing. Somehow it does not create gene entries for long SIRV exons. IsoQuant then iterates over genes and cannot find any genes on long SIRV "chromosomes", hence all alignments are marked as intergenic.

I'll dig a bit more to figure out whether this can be easily fixed without reporting the problem to gffutils authors.

Best Andrey

houlinyu commented 9 months ago

@andrewprzh Andrey, thanks for the response- it makes sense. Hope this can be easily fixed. Thanks, Houlin

andrewprzh commented 9 months ago

@houlinyu

Now I got it. I think I might have seen this before. Gffutils freaks out when gene_id and transcript_id are equal (I guess because it creates an SQL database and all keys must be unique). This was exactly the case for long SIRV entries.

I attached a slightly modified GTF that should work just fine. SIRV_ERCC_longSIRV_multi-fasta_20210507.zip Just don't forget to remove old db file :)

I checked on your BAM file and got 2024-02-21 00:29:45,487 - INFO - ambiguous: 1 2024-02-21 00:29:45,487 - INFO - unique: 175

I'll add some additional checks in the future releases to warn users in case something like this happens.

Best Andrey

houlinyu commented 9 months ago

@andrewprzh

Sounds great! Many thanks for your help.

Best, Houlin

petersbrC commented 9 months ago

Would the gff3utils key issue be the problem as well when a reference sequence has the same name as a gene? The isoquant traceback is below.

Taken from the genome .fai file to show the ERCC spike in sequence is present in the genome fasta file ERCC-00033 2014 2228483004 100 101 and here is the matching GFF3 record ERCC-00033 ERCC gene 1 2013 . + . ID=ERCC-00033;Note="RNA spike-in" ERCC-00033 ERCC mRNA 1 2013 . + . ID=ERCC-00033.m1;Parent=ERCC-00033;Note="RNA spike-in" ERCC-00033 ERCC exon 1 2013 . + . ID=ERCC-00033.m1.e1;Parent=ERCC-00033.m1

Traceback:

concurrent.futures.process._RemoteTraceback: 3866 """ 3867 Traceback (most recent call last): 3868 File "python3.8/concurrent/futures/process.py", line 239, in _process_worker 3869 r = call_item.fn(*call_item.args, *call_item.kwargs) 3870 File "python3.8/concurrent/futures/process.py", line 198, in _process_chunk 3871 return [fn(args) for args in chunk] 3872 File "python3.8/concurrent/futures/process.py", line 198, in 3873 return [fn(*args) for args in chunk] 3874 File "isoquant-3.3.1-0/src/dataset_processor.py", line 109, in collect_reads_in_parallel 3875 AlignmentCollector(chr_id, bam_file_pairs, args, gffutils_db, current_chr_record, read_grouper) 3876 File "isoquant-3.3.1-0/src/alignment_processor.py", line 228, in init 3877 self.bam_pairs[0][0].get_reference_length(self.chr_id), 3878 File "pysam/libcalignmentfile.pyx", line 1914, in pysam.libcalignmentfile.AlignmentFile.get_reference_length 3879 File "pysam/libcalignmentfile.pyx", line 507, in pysam.libcalignmentfile.AlignmentHeader.get_reference_length 3880 KeyError: 'unknown reference ERCC-00033' 3881 """ 3882 3883 The above exception was the direct cause of the following exception: 3884 3885 Traceback (most recent call last): 3886 File "isoquant.py", line 698, in 3887 main(sys.argv[1:]) 3888 File "isoquant.py", line 692, in main 3889 run_pipeline(args) 3890 File "isoquant.py", line 645, in run_pipeline 3891 dataset_processor.process_all_samples(args.input_data) 3892 File "isoquant-3.3.1-0/src/dataset_processor.py", line 368, in process_all_samples 3893 self.process_sample(sample) 3894 File "isoquant-3.3.1-0/src/dataset_processor.py", line 392, in process_sample 3895 self.collect_reads(sample) 3896 File "isoquant-3.3.1-0/src/dataset_processor.py", line 449, in collect_reads 3897 for storage, read_groups in results: 3898 File "python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists 3899 for element in iterable: 3900 File "python3.8/concurrent/futures/_base.py", line 619, in result_iterator 3901 yield fs.pop().result() 3902 File "python3.8/concurrent/futures/_base.py", line 437, in result 3903 return self.get_result() 3904 File "python3.8/concurrent/futures/_base.py", line 389, in get_result 3905 raise self._exception 3906 KeyError: 'unknown reference ERCC-00033' 3907

andrewprzh commented 9 months ago

@petersbrC

The error message looks odd, I don't thing identical IDs of a FASTA record and a gene name should cause the problem... Could you send me the BAM file and the reference files if possible?

Best Andrey

andrewprzh commented 6 months ago

I added GTF consistency checks in IsoQuant 3.4, so that such issues could be detected easily.

ablab / IsoQuant

SIRVs data #155