daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
284 stars 78 forks source link

Import of introns fails #181

Closed Yu-jinKim closed 2 years ago

Yu-jinKim commented 2 years ago

Hi,

I have been trying to pull out exons of a GFF but i'm getting an error that i don't understand.

Here is a head -14 of that GFF which is enough to get the error:

#OriSeqID=Chr1  Accession=GWHACDB00000001.1
GWHACDB00000001.1   .   gene    9738    14906   1   +   .   ID=Hc.01G000010;Accession=GWHGACDB000001.1;Augustus_transcriptSupport_percentage=100;Augustus_intronSupport=4/4;Source=augustus000002;Integrity=complete;;transl_table=1
GWHACDB00000001.1   .   mRNA    9738    14906   1   +   .   ID=Hc.01G000010.t1;Accession=GWHTACDB000001.1;Parent=Hc.01G000010;Parent_Accession=GWHGACDB000001.1;IntronSupport=4/4;Integrity=complete;;transl_table=1
GWHACDB00000001.1   .   exon    9738    11686   .   +   .   ID=Hc.01G000010.t1.exon1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   CDS 9738    11686   1   +   0   ID=Hc.01G000010.t1.CDS1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   exon    11897   12671   .   +   .   ID=Hc.01G000010.t1.exon2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   CDS 11897   12671   1   +   1   ID=Hc.01G000010.t1.CDS2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   exon    13090   13368   .   +   .   ID=Hc.01G000010.t1.exon3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   CDS 13090   13368   1   +   0   ID=Hc.01G000010.t1.CDS3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   exon    13650   14396   .   +   .   ID=Hc.01G000010.t1.exon4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   CDS 13650   14396   1   +   0   ID=Hc.01G000010.t1.CDS4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   exon    14601   14906   .   +   .   ID=Hc.01G000010.t1.exon5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   CDS 14601   14708   1   +   0   ID=Hc.01G000010.t1.CDS5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1   .   three_prime_UTR 14709   14906   .   +   .   ID=Hc.01G000010.t1.3UTR1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1

Using this:

import gffutils

try:
    db = gffutils.create_db("test.gff", "gff.sqlite", verbose=True)
except Exception as e:
    db = gffutils.interface.FeatureDB("gff.sqlite")

db.update(db.create_introns(), verbose=True)

I get this

2022-03-11 14:24:50,576 - INFO - Populating features
Traceback (most recent call last):t-order relations: 0 features
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 589, in _populate_from_lines
    self._insert(f, c)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 530, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kimy/daz/parse_gff.py", line 11, in <module>
    db.update(db.create_introns(), verbose=True)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/interface.py", line 938, in update
    db._populate_from_lines(data)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 591, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 226, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID Hc.01G000010.t1.exon1

Looking at the introns, they look fine:

import gffutils

try:
    db = gffutils.create_db("test.gff", "gff.sqlite", verbose=True)
except Exception as e:
    db = gffutils.interface.FeatureDB("gff.sqlite")

for intron in db.create_introns():
    print(intron)
GWHACDB00000001.1       gffutils_derived        intron  11687   11896   .       +       .       ID=Hc.01G000010.t1.exon1,Hc.01G000010.t1.exon2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  12672   13089   .       +       .       ID=Hc.01G000010.t1.exon2,Hc.01G000010.t1.exon3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  13369   13649   .       +       .       ID=Hc.01G000010.t1.exon3,Hc.01G000010.t1.exon4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  14397   14600   .       +       .       ID=Hc.01G000010.t1.exon4,Hc.01G000010.t1.exon5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1

Doesn't the import support {exon1_id},{exon2_id} IDs? I have looked at the source code but i'm not sure why it fails.

daler commented 2 years ago

Hmm. For all other attributes, a comma indicates multiple values for that attribute. I don't think it makes sense to have multiple values for the primary key, so this should raise an exception with a clearer description of the problem.

https://github.com/daler/gffutils/pull/193 does this.

In your case you can make it work by providing an id spec that will convert your list of IDs to a string, like this:

def intron_id(f):
    return ','.join(f['ID'])

db.update(introns, id_spec='intron': [intron_id]})

This test verifies it works with the GFF you provided.

daler commented 2 years ago

Now addressed in v0.11.