NAL-i5K / GFF3toolkit

Python programs for processing GFF3 files
Other
95 stars 27 forks source link

SeqID does not end with a number. #126

Closed tbazilegith closed 2 years ago

tbazilegith commented 2 years ago

Hello, I ran gff3_sort using the command below and got the error that follows gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3

ERROR [SeqID] SeqID does not end with a number.

I went ahead and added the flag -r gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3 -r But I got this

Traceback (most recent call last): File "/apps/gff3toolkit/2.0.3/bin/gff3_sort", line 8, in sys.exit(script_main()) File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 437, in script_main main(args.gff_file, output=args.output_gff, isoform_sort=args.isoform_sort, sorting_order=sorting_order, logger=logger_stderr, reference=args.reference) File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 223, in main sequence_regions[sequence_region['seqid']] = (sequence_region['start'], sequence_region['end']) KeyError: 'end'

It seems to me that the above "Line 6" must be skipped in the file annot.gff

Any thought on that? Thanks, TJ

mpoelchau commented 2 years ago

@tbazilegith this error looks similar to the one reported in #125. Can you post some examples of the sequence directive lines? Do they all have a number as the end coordinate?

tbazilegith commented 2 years ago

Hello MPoelchau, Here is what I have

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

sequence-region 1 3396752

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1386

1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme 1 . pseudogene 1 144 . - . ID=gene-tmp_000001;Name=tmp_000001;gbkey=Gene;gene_biotype=pseudogene;locus_tag=tmp_000001;pseudo=true

Thanks, TJ

tbazilegith commented 2 years ago

Hello MPoelchau, Here is the full header

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

sequence-region 1 3396752

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1386

1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme 1 . pseudogene 1 144 . - . ID=gene-tmp_000001;Name=tmp_000001;gbkey=Gene;gene_biotype=pseudogene;locus_tag=tmp_000001;pseudo=true Thanks, TJ

mpoelchau commented 2 years ago

@tbazilegith looks like the sequence region directive is missing a '1' (representing either the chromosome or the start coordinate). The format is ##sequence-region seqid start end. So it should instead be ##sequence-region 1 1 3396752

mpoelchau commented 2 years ago

@tbazilegith just following up, did fixing the sequence region directive work for you?

mpoelchau commented 2 years ago

I'll close this issue but feel free to re-open if that didn't help.