output GTF format compliance

rsalz commented 2 years ago

Is there a way you could make the output GTF file semantically compliant? I get multiple error messages when using AGAT gff2bed that I never encountered when using TALON gtf output with the same tool. The transcript ids are not preserved when changing formats. If you could make the GTF compatible with AGAT, then it is likely good enough for any other downstream tools others would like to use after IsoQuant. Thanks in advance for any help or suggestions on this!

andrewprzh commented 2 years ago

Dear @rsalz

Could you please provide an example of the error messages and the GTF itself? I ran AGAT gff2bed on an simple IsoQuant output and it worked just fine.

Best Andrey

rsalz commented 2 years ago

I used the "transcript_models.gtf" that was output of the following command: isoquant.py --reference gencode/GRCh38.primary_assembly.genome.fa.gz --genedb gencode.v39.annotation.gtf --complete_genedb --fastq_list sample_names.txt --read_group file_name --data_type pacbio_ccs --fl_data --check_canonical --count_exons --threads 20 --transcript_quantification with_ambiguous --gene_quantification unique_only --splice_correction_strategy default_pacbio --model_construction_strategy fl_pacbio -o five_conditions

So it starts with these warnings

There is a problem we found several formats in this file:
2,3
Let's see what we can do...
=> GFF version parser used: 3
gff3 reader error level1: No ID attribute found @ for the feature: chr1 IsoQuant        gene    14360   29570   .     -
        .
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant        transcript      14360   29373 .
        -       .       Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1       IsoQuant        transcript      14360   29373 .
        -       .       Canonical True ; ID "transcript-1"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will han
dle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the f
eature is:
chr1    IsoQuant        transcript      14360   29373   .       -       .       Canonical True ; ID "transcript-1"
WARNING level3: No Parent attribute found @ for the feature: chr1       IsoQuant        exon    29321   29373   .     -
        .       ID "exon-1"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will han
dle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the f
eature is:
chr1    IsoQuant        exon    29321   29373   .       -       .       ID "exon-1"

and then there are a bunch of these:


gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant        transcript      14360   29373 .
        -       .       Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1       IsoQuant        transcript      14360   29373 .
        -       .       Canonical True ; ID "transcript-2"
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant        transcript      14360   29373 .
        -       .       Canonical True
WARNING level2: No Parent attribute found @ for the feature: chr1       IsoQuant        transcript      14360   29373 .
        -       .       Canonical True ; ID "transcript-3"
gff3 reader error level2: No ID attribute found @ for the feature: chr1 IsoQuant        transcript      14360   29373 .
        -       .       Canonical True

in summation:


48470 warning messages: WARNING level2: No Parent attribute found
473055 warning messages: WARNING level3: No Parent attribute found
48470 warning messages: gff3 reader error level2: No ID attribute found
521525 warning messages: WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus.
13897 warning messages: gff3 reader error level1: No ID attribute found

rsalz commented 2 years ago

i found the format violations in the GTF. GTF format attributes field should always be in the format attribute1 "value1"; attribute2 "value2";

you have transcripts: 4; in the gene lines to indicate how many transcripts there are for a gene
you also have Canonical=True; in the transcript lines to say whether transcript is canonical
you include a header in the GTF file that starts with # to include the command used - better to remove this

I recommend you fix these violations to adhere to the GTF formatting guidelines. After fixing these, i got AGAT to function properly.

andrewprzh commented 2 years ago

Dear @rsalz

Thanks a lot! Will fix and update the release.

Regarding point 3 - I find it quite useful to keep this information in GTF (as people may move/send it where the log files are not available). I guess comments are supported by GTF at any line (https://agat.readthedocs.io/en/latest/gxf.html#gtf), am I wrong?

Best Andrey

andrewprzh commented 2 years ago

Released 3.0.2 with attribute fixes https://github.com/ablab/IsoQuant/releases/tag/v3.0.2

rsalz commented 2 years ago

Thanks! A couple more suggestions:

in the transcript_models files, known transcripts are still given identifiers such as 'transcript_238.chr1.known'. I wonder why you would make up identifiers for known transcripts when identifiers are already present e.g. ENST12345678 (and then you put those in another attribute called reference_transcript_id). This is making things a bit difficult for downstream analysis... i would recommend removing 'reference_transcript_id', putting known IDs in the regular transcript_id field instead of inventing new ones, and making another attribute if desired like type_novelty with 'nic' and 'nnc' and 'known' as values
the underscore in the transcript id field (between transcript and 238 in transcript_238.chr1.known) is a bit annoying for parsing. I would like to see the underscore removed

andrewprzh commented 2 years ago

Dear @rsalz

The IDs were initially made for LRGASP challenge that had specific requirements for GTF, so kind of an old legacy. I agree that inventing new names for known transcripts decreases usability and can be removed.
No specific reasons for using underscore :) Will do.

rsalz commented 2 years ago

Could you please add exon ids to the gtf output? The exons have numbers relative to each other per transcript but they cannot be compared between different transcripts so easily without adding exon ids. thanks

andrewprzh commented 1 year ago

Dear @rsalz

Yes, this is possible. Although I'm currently busy with other projects, I will return to IsoQuant maintainence soon.

For now I suggest to combine chromosome, coordinates and strand, e.g. chr1_100007034100007156+, that should uniquiely identify the exon.

Best Andrey

rsalz commented 1 year ago

I noticed that in some novel genes, the strand field in the gtf is . . Why?

andrewprzh commented 1 year ago

Could you send me example? I presume it may happen for mono-exon genes since IsoQuant detects strands based on canonical splice sites.

rsalz commented 1 year ago

Dear @rsalz

Yes, this is possible. Although I'm currently busy with other projects, I will return to IsoQuant maintainence soon.

For now I suggest to combine chromosome, coordinates and strand, e.g. chr1_100007034100007156+, that should uniquiely identify the exon.

Best Andrey

Have you been able to do this? It would be great if you could enrich the GTF with even more additional information such as gene name and gene type, like TALON's output.

andrewprzh commented 1 year ago

Hi @rsalz

Monoexonic gene strand detection should be fixed in version 3.1.

Exon ids and gene information will be added in 3.2. Hope to get my hands on this soon.

Best Andrey

andrewprzh commented 1 year ago

@rsalz

Finally, 3.2 is released, exon ids and additional gene information is now in the output annotation.

Best Andrey

ablab / IsoQuant

output GTF format compliance #42