daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

Handle a bug in NCBI gtf #172

Closed esrice closed 2 years ago

esrice commented 3 years ago

NCBI gtf sometimes has an empty transcript_id field, which causes an IndexError when creating a db because gffutils sees that there is a transcript_id attribute present and then tries to use it depsite it being blank. This commit adds a second check to make sure that it is not blank before accessing it to avoid this error.

Example of bad line in NCBI gtf:

NC_049222.1     Gnomon  gene    209085  282880  .       -       .       gene_id "ENPP1_3"; transcript_id ""; db_xref "GeneID:100856150"; db_xref "VGNC:VGNC:40374"; gbkey "Gene"; gene "ENPP1"; gene_biotype "protein_coding";

Stacktrace of error in gffutils caused by this line:

  File "/storage/hpc/group/warrenlab/users/esrbhb/mambaforge/envs/bio/lib/python3.9/site-packages/gffutils/create.py", line 1292, in create_db
    c.create()
  File "/storage/hpc/group/warrenlab/users/esrbhb/mambaforge/envs/bio/lib/python3.9/site-packages/gffutils/create.py", line 507, in create
    self._populate_from_lines(self.iterator)
  File "/storage/hpc/group/warrenlab/users/esrbhb/mambaforge/envs/bio/lib/python3.9/site-packages/gffutils/create.py", line 788, in _populate_from_lines
    parent = f.attributes[self.transcript_key][0]
IndexError: list index out of range