cgat-developers / cgat-apps

cgat-apps repository
Other
33 stars 14 forks source link

KeyError: 'ID' on gff32gtf utility #37

Closed sandbardev closed 5 years ago

sandbardev commented 5 years ago

I am trying to convert two gff3 files to gtf using CGAT's gff32gtf utility: ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/zea_mays (25798 KB) ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/sorghum_bicolor/ (7098 KB)

input:

cgat gff32gtf -I Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr.gff3 -S Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr_cgatConverted.gtf

output:

2019-04-30 10:21:11,000 INFO output generated by gff32gtf -I Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr.gff3 -S Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr_cgatConverted.gtf \ job started at Tue Apr 30 10:21:11 2019 on XPS-MFL -- 0f207b8b-7463-4ff8-bf55-dc86ba574ad8 \ pid: 5319, system: Linux 4.15.0-47-generic #50 16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019 x86_64 2019-04-30 10:21:11,000 INFO by_chrom : False \ discard : True \ gene_field_or_pattern : ID \ gene_type : gene \ log_config_filename : None \ loglevel : 1 \ method : hierarchy \ missing_gene : True \ parent : Parent \ random_seed : None \ read_twice : False \ short_help : None \ stderr : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'> \ stdin : <_io.TextIOWrapper name='Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr.gff3' mode='r' encoding='utf-8'> \ stdlog : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'> \ stdout : <_io.TextIOWrapper name='Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-13-2019.chr_cgatConverted.gtf' mode='w' encoding='utf-8'> \ timeit_file : None \ timeit_header : None \ timeit_name : all \ tracing : None \ transcript_field_or_pattern : ID \ transcript_type : mRNA Traceback (most recent call last): File "/home/mfl/miniconda3/envs/cgat-apps[v0.5.3]/bin/cgat", line 11, in sys.exit(main()) File "/home/mfl/miniconda3/envs/cgat-apps[v0.5.3]/lib/python3.6/site-packages/cgat/cgat.py", line 132, in main module.main(sys.argv) File "/home/mfl/miniconda3/envs/cgat-apps[v0.5.3]/lib/python3.6/site-packages/cgat/tools/gff32gtf.py", line 352, in main convert_hierarchy(chunk, second_gff_chunk, options) File "/home/mfl/miniconda3/envs/cgat-apps[v0.5.3]/lib/python3.6/site-packages/cgat/tools/gff32gtf.py", line 193, in convert_hierarchy options.gene_field_or_pattern, gff['ID']), File "/home/mfl/miniconda3/envs/cgat-apps[v0.5.3]/lib/python3.6/site-packages/cgat/GTF.py", line 1043, in getitem return self.attributes[key] KeyError: 'ID' (cgat-apps[v0.5.3]) mfl@XPS-MFL:~/Documents/genome_info/sorghum_bicolor$

The error message doesn't tell me much. Should I have any added criteria in order for this to work properly?

Acribbs commented 5 years ago

So if you are interested in generating a GTF from your GFF3 using the default "hierarchy" setting then you dont need to specify the --method option. unless you modify the --gene-id or --transcript-id the values default to "ID".

Looking at your GFF3 file, your 4th line of the record does not contain the ID attribute:

1       Joint Genome Institute  chromosome      1       80884392        .       .       .       ID=chromosome:1;Alias=CM000760.3,chr1,NC_012870.2
###
1       ena     gene    1951    2616    .       +       .       ID=gene:SORBI_3001G000100;biotype=protein_coding;description=hypothetical protein;gene_id=SORBI_3001G000100;logic_name=ena
1       ena     mRNA    1951    2616    .       +       .       ID=transcript:EER90453;Parent=gene:SORBI_3001G000100;biotype=protein_coding;transcript_id=EER90453
1       ena     exon    1951    2454    .       +       .       Parent=transcript:EER90453;Name=EER90453-1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=EER90453-1;rank=1

Therefore it gives the error you received stating that there is no ID field. I dont know what you are planning on doing downstream with this file, but it might be better to remove the exon entries into a seperate file and convert to GFF seperately.

sandbardev commented 5 years ago

The plan is to use the converted GTF output as input for gtf2gtf and filter for longest-transcript. We'll use them to perform DE analyses. I tried separating the exons, which in the GFF3 file none have an 'ID' tag, and then setting --gene-id and --transcript-id on the inputs to see if the program could use another entrance parameter for conversion of these columns, but kept getting the same KeyError:ID on the gff32gtf script.

I also used gffread to convert the GFF3 file into GTF, which yielded initially no errors, and then tried gtf2gtf on it to filter the longest, but got a "Duplicate entry" error on an apparently common line(151). I'll attach the GTF file. gffread seemed to properly know which ID's to use as gene_id for exons based on a previous presence of said ID on the GFF3 file.

Can I do such a conversion using cgat's script? I figured the gffread conversion might have been compromised, which is why I'm trying to use gff32gtf. Alternatively, how do I properly set which entrance criteria cgat's script will use instead of 'ID' to configure the conversion?

Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43_15-03-2019.chr_gffreadConverted.gtf.gz

Acribbs commented 5 years ago

In order to set the field you need to specify the method to set-field.

I ran:

zcat Sorghum_bicolor.Sorghum_bicolor_NCBIv3.43.chr.gff3.gz | grep exon | gzip > exon.gff3.gz
cgat gff32gtf --method=set-field --transcript-id=Parent --gene-id=Parent -I exon.gff3.gz -S out.gtf

does that help?

For the default hierarchical you require both ID and Parent fields

sandbardev commented 5 years ago

Yes, it does help. Both fields seem to output the same code, however. If I run

cgat gff32gtf --method=set-field --transcript-id=Parent --gene-id=Name -I sorgo_exons.gff3 -S sorgo_exons_cgatConverted--setfield.gtf 

the output file will follow this structure:

1   ena exon    1951    2454    .   +   .   gene_id "['transcript:EER90453']"; transcript_id "['transcript:EER90453']"; Parent "transcript:EER90453"; Name "EER90453-1"; constitutive 1; ensembl_end_phase 0; ensembl_phase 0; exon_id "EER90453-1"; rank 1;
1   ena exon    2473    2616    .   +   .   gene_id "['transcript:EER90453']"; transcript_id "['transcript:EER90453']"; Parent "transcript:EER90453"; Name "EER90453-2"; constitutive 1; ensembl_end_phase 0; ensembl_phase 0; exon_id "EER90453-2"; rank 2;
1   ena exon    11180   11531   .   -   .   gene_id "['transcript:EER93047']"; transcript_id "['transcript:EER93047']"; Parent "transcript:EER93047"; Name "EER93047-11"; constitutive 1; ensembl_end_phase 1; ensembl_phase 0; exon_id "EER93047-11"; rank 11;

It does solve the problem that this issue concerns, however. I'll close it now, thank you.