daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
289 stars 78 forks source link

Handling of multi-line features with same ID #206

Closed andrewkennard closed 1 year ago

andrewkennard commented 1 year ago

I am excited to use gffutils for updating and combining multiple annotation sources! Thank you very much for the work to put together this vital tool. One thing I am confused about is the implementation of the ID attribute for multi-line features like a CDS: I understand that gffutils is treating each row in a GFF file as a separate Feature, and that therefore requires a unique ID in the database. But this seems to contradict the GFF3 specification https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md which states that the ID field must be used for entries (like a CDS) that occur over multiple lines, and furthermore, the same ID must be used for each of those fields. This blog post also goes into the details: https://biowize.wordpress.com/2013/05/08/gff3-101-multi-line-features-and-multiple-parents/

Is this a limitation of the design of gffutils, or something that could be resolved in the same database framework? I noticed in the gffutils documentation that discontinuous CDS features are treated as separate entities linked by Name field but not an ID field. Is this standard practice or a workaround for gffutils?

andrewkennard commented 1 year ago

I noticed that #202 is covering this issue. I thought I was handling this with the merge_strategy='create_unique' option when creating the database but upon updating one database with entries from another database with multi-line features I encountered a UNIQUE integrity error.

I think this is probably solved by me studying the options more carefully, in particular defining a more sophisticated id_spec or adding an attribute to help create a unique primary key.

daler commented 1 year ago

If you end up having trouble coming up with a way of handling it, you can always post an example and I can try to get something working. Agreed though, the solution likely lies in id_spec.

yangyxt commented 1 year ago

I also run into a similar error when trying to parse a GENCODE gene annotation gff3 file into sqlite db with gffutils. The error come from the lines recording CDS regions and 5UTR, 3UTR regions since they share the same ID across different genomic intervals.

I already setup the merge_strategy to merge and create_unique but both returned an error like this:

`INFO:2023-06-15 18:09:11,191:create_gffutils_db:334:The output database file is /paedyl01/disk1/yangyxt/public_data/gene_annotation/gencode.v43lift37.annotation.db Traceback (most recent call last): File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 622, in _populate_from_lines self._insert(f, c) File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert cursor.execute(constants._INSERT, feature.astuple()) sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/paedyl01/disk1/yangyxt/ngs_scripts/python_utils.py", line 344, in create_gffutils_db **kwargs) File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 1401, in create_db c.create() File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 543, in create self._populate_from_lines(self.iterator) File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 656, in _populate_from_lines self._insert(f, c) File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert cursor.execute(constants._INSERT, feature.astuple()) sqlite3.IntegrityError: UNIQUE constraint failed: features.id`

And here is my command, btw I'm using gffutils 0.11.1: db = gffutils.create_db(gff_file, dbfn=db_file, force=True, keep_order=True, sort_attribute_values=True, merge_strategy = "merge",

Here I offer the GFF3 file for you to reproduce the error: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh37_mapping/gencode.v43lift37.annotation.gff3.gz

Please take a look @daler