Closed sandbardev closed 5 years ago
So if you are interested in generating a GTF from your GFF3 using the default "hierarchy" setting then you dont need to specify the --method option. unless you modify the --gene-id or --transcript-id the values default to "ID".
Looking at your GFF3 file, your 4th line of the record does not contain the ID attribute:
1 Joint Genome Institute chromosome 1 80884392 . . . ID=chromosome:1;Alias=CM000760.3,chr1,NC_012870.2
###
1 ena gene 1951 2616 . + . ID=gene:SORBI_3001G000100;biotype=protein_coding;description=hypothetical protein;gene_id=SORBI_3001G000100;logic_name=ena
1 ena mRNA 1951 2616 . + . ID=transcript:EER90453;Parent=gene:SORBI_3001G000100;biotype=protein_coding;transcript_id=EER90453
1 ena exon 1951 2454 . + . Parent=transcript:EER90453;Name=EER90453-1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=EER90453-1;rank=1
Therefore it gives the error you received stating that there is no ID field. I dont know what you are planning on doing downstream with this file, but it might be better to remove the exon entries into a seperate file and convert to GFF seperately.
The plan is to use the converted GTF output as input for gtf2gtf and filter for longest-transcript. We'll use them to perform DE analyses. I tried separating the exons, which in the GFF3 file none have an 'ID' tag, and then setting --gene-id and --transcript-id on the inputs to see if the program could use another entrance parameter for conversion of these columns, but kept getting the same KeyError:ID on the gff32gtf script.
I also used gffread to convert the GFF3 file into GTF, which yielded initially no errors, and then tried gtf2gtf on it to filter the longest, but got a "Duplicate entry" error on an apparently common line(151). I'll attach the GTF file. gffread seemed to properly know which ID's to use as gene_id for exons based on a previous presence of said ID on the GFF3 file.
Can I do such a conversion using cgat's script? I figured the gffread conversion might have been compromised, which is why I'm trying to use gff32gtf. Alternatively, how do I properly set which entrance criteria cgat's script will use instead of 'ID' to configure the conversion?
Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-03-2019.chr_gffreadConverted.gtf.gz
In order to set the field you need to specify the method to set-field.
I ran:
zcat Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43.chr.gff3.gz | grep exon | gzip > exon.gff3.gz
cgat gff32gtf --method=set-field --transcript-id=Parent --gene-id=Parent -I exon.gff3.gz -S out.gtf
does that help?
For the default hierarchical you require both ID and Parent fields
Yes, it does help. Both fields seem to output the same code, however. If I run
cgat gff32gtf --method=set-field --transcript-id=Parent --gene-id=Name -I sorgo_exons.gff3 -S sorgo_exons_cgatConverted--setfield.gtf
the output file will follow this structure:
1 ena exon 1951 2454 . + . gene_id "['transcript:EER90453']"; transcript_id "['transcript:EER90453']"; Parent "transcript:EER90453"; Name "EER90453-1"; constitutive 1; ensembl_end_phase 0; ensembl_phase 0; exon_id "EER90453-1"; rank 1;
1 ena exon 2473 2616 . + . gene_id "['transcript:EER90453']"; transcript_id "['transcript:EER90453']"; Parent "transcript:EER90453"; Name "EER90453-2"; constitutive 1; ensembl_end_phase 0; ensembl_phase 0; exon_id "EER90453-2"; rank 2;
1 ena exon 11180 11531 . - . gene_id "['transcript:EER93047']"; transcript_id "['transcript:EER93047']"; Parent "transcript:EER93047"; Name "EER93047-11"; constitutive 1; ensembl_end_phase 1; ensembl_phase 0; exon_id "EER93047-11"; rank 11;
It does solve the problem that this issue concerns, however. I'll close it now, thank you.
I am trying to convert two gff3 files to gtf using CGAT's gff32gtf utility: ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/zea_mays (25798 KB) ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/sorghum_bicolor/ (7098 KB)
input:
output:
The error message doesn't tell me much. Should I have any added criteria in order for this to work properly?