Closed rsalz closed 1 year ago
Dear @rsalz
Could you please provide an example of the error messages and the GTF itself? I ran AGAT gff2bed on an simple IsoQuant output and it worked just fine.
Best Andrey
I used the "transcript_models.gtf" that was output of the following command:
isoquant.py --reference gencode/GRCh38.primary_assembly.genome.fa.gz --genedb gencode.v39.annotation.gtf --complete_genedb --fastq_list sample_names.txt --read_group file_name --data_type pacbio_ccs --fl_data --check_canonical --count_exons --threads 20 --transcript_quantification with_ambiguous --gene_quantification unique_only --splice_correction_strategy default_pacbio --model_construction_strategy fl_pacbio -o five_conditions
So it starts with these warnings
There is a problem we found several formats in this file:
2,3
Let's see what we can do...
=> GFF version parser used: 3
gff3 reader error level1: No ID attribute found @ for the feature: chr1 IsoQuant gene 14360 29570 . -
.
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True ; ID "transcript-1"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will han
dle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the f
eature is:
chr1 IsoQuant transcript 14360 29373 . - . Canonical True ; ID "transcript-1"
WARNING level3: No Parent attribute found @ for the feature: chr1 IsoQuant exon 29321 29373 . -
. ID "exon-1"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will han
dle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the f
eature is:
chr1 IsoQuant exon 29321 29373 . - . ID "exon-1"
and then there are a bunch of these:
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True ; ID "transcript-2"
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True ; ID "transcript-3"
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant transcript 14360 29373 .
- . Canonical True
in summation:
48470 warning messages: WARNING level2: No Parent attribute found
473055 warning messages: WARNING level3: No Parent attribute found
48470 warning messages: gff3 reader error level2: No ID attribute found
521525 warning messages: WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus.
13897 warning messages: gff3 reader error level1: No ID attribute found
i found the format violations in the GTF.
GTF format attributes field should always be in the format attribute1 "value1"; attribute2 "value2";
transcripts: 4;
in the gene lines to indicate how many transcripts there are for a geneCanonical=True;
in the transcript lines to say whether transcript is canonicalI recommend you fix these violations to adhere to the GTF formatting guidelines. After fixing these, i got AGAT to function properly.
Dear @rsalz
Thanks a lot! Will fix and update the release.
Regarding point 3 - I find it quite useful to keep this information in GTF (as people may move/send it where the log files are not available). I guess comments are supported by GTF at any line (https://agat.readthedocs.io/en/latest/gxf.html#gtf), am I wrong?
Best Andrey
Released 3.0.2 with attribute fixes https://github.com/ablab/IsoQuant/releases/tag/v3.0.2
Thanks! A couple more suggestions:
Dear @rsalz
Could you please add exon ids to the gtf output? The exons have numbers relative to each other per transcript but they cannot be compared between different transcripts so easily without adding exon ids. thanks
Dear @rsalz
Yes, this is possible. Although I'm currently busy with other projects, I will return to IsoQuant maintainence soon.
For now I suggest to combine chromosome, coordinates and strand, e.g. chr1_100007034100007156+, that should uniquiely identify the exon.
Best Andrey
I noticed that in some novel genes, the strand field in the gtf is .
. Why?
Could you send me example? I presume it may happen for mono-exon genes since IsoQuant detects strands based on canonical splice sites.
Dear @rsalz
Yes, this is possible. Although I'm currently busy with other projects, I will return to IsoQuant maintainence soon.
For now I suggest to combine chromosome, coordinates and strand, e.g. chr1_100007034100007156+, that should uniquiely identify the exon.
Best Andrey
Have you been able to do this? It would be great if you could enrich the GTF with even more additional information such as gene name and gene type, like TALON's output.
Hi @rsalz
Monoexonic gene strand detection should be fixed in version 3.1.
Exon ids and gene information will be added in 3.2. Hope to get my hands on this soon.
Best Andrey
@rsalz
Finally, 3.2 is released, exon ids and additional gene information is now in the output annotation.
Best Andrey
Is there a way you could make the output GTF file semantically compliant? I get multiple error messages when using AGAT gff2bed that I never encountered when using TALON gtf output with the same tool. The transcript ids are not preserved when changing formats. If you could make the GTF compatible with AGAT, then it is likely good enough for any other downstream tools others would like to use after IsoQuant. Thanks in advance for any help or suggestions on this!