Closed Juke34 closed 9 months ago
The problem is not related to the conversion itself but rather the input file processing... and the poorly formatted format. No worries, it is possible to correct that with AGAT but let' first explain what is going on:
There are in theory 3 type of feature, feature level1 (e.g. gene), feature level2 (e.g. mRNA) and feature level3 (CDS, exon, start codon, stop codon). Features can be linked to each other to represent a record (e.g exon/CDS, mRNA and gene of a locus) The relationship between features representing a record is made via ID/Parent attribute in GFF and gene_id/transcript_id attribute in GTF. Your file does not follow these specifications.
1) In your input file have only features level3 2) You do not have proper relationship (ID/Parent attribute in GFF or gene_id/transcript_id)
i)AGAT while parsing try to use ID/Parent or gene_id/transcript_id attributes, but it fails. ii) Then it tries to instead use a shared attribute allowing to group features. What attribute? It is specified at the beginning of the log
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
* locus_tag
* gene_id
Unfortunaly none of these attributes is in your file. So it fails iii) Then AGAT will parse the file sequentially. Each level3 feature will be attached to the last Level2 feature encountered and each Level2 feature will be attached to the last Level1 feature encountered. Unfortunate you do not have level1 and level2 feature.... So what happens? AGAT creates on the fly a level1 feature (gene) and a level2 feature (mRNA) and attach everything to them... but it is not what you want to achieve.
Solution:
If you know your species and your file you should know if your file contains isoforms. This information is important.
Now modifying the config file you can tell AGAT to use another attribute to group features properly. e.g. transcriptId
Using this attribute will work well but will not allow isoform... why? Because you will get one gene by transcript...
No way to collect several transcripts under the same gene, because this information is missing.
You might use the name
attribute to group several transcript will the same name under a same gene feature. But In that case you might be sure that no name
is share between transcript that do not have any link (e.g. on different chromosome, etc...)
From an user...
Here an example of my GTF:
Here is the GFF output:
Everything is collected under the same gene and transcript umbrella...
Could you help me to understand what is going on?