ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
153 stars 13 forks source link

GFF3 cannot be recognized #240

Open sanyalab opened 1 month ago

sanyalab commented 1 month ago

Hi,

The tool says that it can work with GFF3. But it only works with GTF. Can we get GFF3 support?

image

Error I get when I provide GFF3 formatted file with the --genedb option

2024-09-19 11:35:13,297 - ERROR - Input GTF seems to be corrupted (see warnings above).
2024-09-19 11:35:13,297 - ERROR - An attempt to correct this GTF was made, the result is written to dummy.corrected.gff3
2024-09-19 11:35:13,297 - ERROR - NB! some transcript / gene ids in the corrected annotation are modified.
2024-09-19 11:35:13,297 - ERROR - Provide a correct GTF by fixing the original input GTF or checking the corrected one.

Do you consume the gene annotations in GTF format or Bed12 format? Is it ok to provide a bed12 file directly?

Thanks Abhijit

andrewprzh commented 1 month ago

Dear @sanyalab

IsoQuant does support both GTF and GFF, but not BED. Could you send me the entire isoquant.log file? Also, you can try running IsoQuant with --no_gtf_check.

Best Andrey

sanyalab commented 1 month ago

Hi Andrey,

I actually went ahead and converted the GFF3 to a geneDB format using gffutils. This would be a preprocessing step. It seems to be running fine now. The isoquant.log file is 152MB in size and I cannot upload the same. But here are the first 10 lines and the last 10 FIRST:

Command line: isoquant.py --reference genome.fa --genedb Annotation.gff3 --fastq Sample1.flnc.fastq Sample2.flnc.fastq Sample3.flnc.fastq Sample4.flnc.fastq --output FL_ALL --prefix OUT --data_type pacbio_ccs --fl_data --threads 24 --check_canonical --sqanti_output --matching_strategy precise --splice_correction_strategy default_pacbio --model_construction_strategy fl_pacbio
2024-09-19 11:34:28,180 - INFO - Running IsoQuant version 3.5.0
2024-09-19 11:34:28,222 - INFO -  === IsoQuant pipeline started ===
2024-09-19 11:34:28,222 - INFO - gffutils version: 0.13
2024-09-19 11:34:28,223 - INFO - pysam version: 0.22.1
2024-09-19 11:34:28,223 - INFO - pyfaidx version: 0.8.1.1
2024-09-19 11:34:28,228 - INFO - Checking input gene annotation
2024-09-19 11:34:29,316 - WARNING - Malformed GTF line 2 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,316 - WARNING - Chr00   GSAP    gene    151 2235    .   +   .   ID=dummy1;Name=dummy1
2024-09-19 11:34:29,316 - WARNING - Malformed GTF line 3 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00   GSAP    mRNA    151 2235    .   +   ID=dummy1.1;Parent=dummy1;Name=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 4 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00   GSAP    exon    151 2235    .   +   .   ID=dummy1.1.exon1;Parent=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 5 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00   GSAP    CDS 151 2235    .   +   0   ID=dummy1.1.cds1;Parent=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 6 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00   GSAP    gene    2412    4316    .   +   .   ID=dummy2;Name=dummy2
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 7 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00   GSAP    mRNA    2412    4316    .   +   .   ID=dummy2.1;Parent=dummy2;Name=dummy2.1

LAST:

2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638230 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26   GSAP    exon    1450283 1450513 .   +   .   ID=dummy6432.1.exon1;Parent=dummy6432.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638231 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26   GSAP    CDS 1450283 1450513 .   +   0   ID=dummy6432.1.cds1;Parent=dummy6432.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638232 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26   GSAP    gene    1465536 1465607 .   -   .   ID=dummy6433;Name=dummy6433
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638233 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26   GSAP    mRNA    1465536 1465607 .   -   .   ID=dummy6433.1;Parent=dummy6433;Name=dummy6433.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638234 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26   GSAP    exon    1465536 1465607 .   -   .   ID=dummy6433.1.exon1;Parent=dummy6433.1
2024-09-19 11:35:13,297 - ERROR - Input GTF seems to be corrupted (see warnings above).
2024-09-19 11:35:13,297 - ERROR - An attempt to correct this GTF was made, the result is written to /Path/FL_ALL/Annotation.corrected.gff3
2024-09-19 11:35:13,297 - ERROR - NB! some transcript / gene ids in the corrected annotation are modified.
2024-09-19 11:35:13,297 - ERROR - Provide a correct GTF by fixing the original input GTF or checking the corrected one.

Its not recognizing the GFF3 file

andrewprzh commented 1 month ago

@sanyalab

Thanks a lot! I will add GFF3 support to the internal checker. So if gffutils converted it, you can run IsoQuant with --no_gtf_check as well.

andrewprzh commented 1 month ago

GFF3 should work in IsoQuant 3.6.1 without warnings.