daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

merge_strategy='create_unique' not always creating unique id's #214

Open novigit opened 1 year ago

novigit commented 1 year ago

The software liftoff compares two assemblies of very closely related genomes, one of which has gene annotations, and the other does not. It then tries to find which regions in the "bare" genomes align well with the "annotated" genome, and then uses the alignment and the gene coordinates of the "annotated" genome to predict the gene coordinates of the equivalent gene in the "bare" genome.

When a template gene is found once in the "annotated" genome but multiple times in the "bare" genome, the attribute ID of the template gene is retained for the first copy in the "bare" genome, but for the second copy a _1 is appended to the attribute ID.

See for example an excerpt from a resultant liftoff gff3 file (toy.edit.gff3):

Seq_18  Liftoff gene    653259  653653  .       +       .       ID=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653259  653334  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653370  653458  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653489  653543  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653577  653653  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1

Seq_28  Liftoff gene    494693  495087  .       -       .       ID=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494693  494769  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494803  494857  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494888  494976  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     495012  495087  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene

The QZVMAR_2781_gene of the original genome has been found twice in the target genome, and the gene and CDS features of the second copy (here on top of the file) have been appended with a _1

As you can imagine, since the merge_strategy='create_unique' uses a similar strategy, this leads to perhaps some unexpected behavior. Applying (trials.py)

db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')

to this GFF3 file leads to the following error:

Traceback (most recent call last):
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 622, in _populate_from_lines
    self._insert(f, c)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./trials.py", line 5, in <module>
    db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 1401, in create_db
    c.create()
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 543, in create
    self._populate_from_lines(self.iterator)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 656, in _populate_from_lines
    self._insert(f, c)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

My guess of what is going on here is that once the script gets to the second CDS of the second gene (the 9th line), it will try to append ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54 with a _1. However, its result, ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1 is already an ID of a CDS of the first gene! Hence, the unique requirement fails and the script crashes.

I've come up with following workaround:

def id_func(x):
    new_id = ''
    if x.featuretype == 'gene':
        new_id = x.attributes['ID'][0]
    elif x.featuretype == 'CDS':
        new_id = '-'.join( [x.attributes['ID'][0], x.seqid, str(x.start), str(x.end)] )
    return new_id

db = gffutils.create_db('toy.edit.gff3', id_spec=id_func, dbfn=':memory:')

which basically creates a custom ID for each CDS feature, using the ID, the start and end coordinates to make sure it is unique.

In any case, I just wanted to point out this unfortunate coincidence between Liftoff and gffutils which leads to errors.

Perhaps it would be possible to update the create_unique function to take into account such cases?

daler commented 1 year ago

Thanks for the detailed description!

Part of the challenge with gffutils is covering all the corner cases simultaneously...your solution will work for this GFF, but I think it may fail in cases where there are identical IDs and positions (but perhaps the feature type or other attributes differ).

Another solution would be to keep trying to increment until you hit something that doesn't exist yet. In that case I think I'd want to have all sorts of warnings because it would be easy to have duplicate features (basically any time you add anything you'd get new entries).

Honestly I think the best solution is to to use the id_spec exactly as you have.

But maybe we could make it easier to discover, like an id_spec module where you could do:

from gffutils import idspecs
db = gffutils.create_db('toy.edit.gff3', id_spec=idpsecs.liftoff, dbfn=':memory:')

I'm not sure what other id_specs would be useful, though I'd imagine digging through past issues could give some clues. Actually, now that I think about it some more, maybe instead of just id_specs it should be entire sets of tool-specific kwargs, so something like:

from gffutils import tool_compatibility
db = gffutils.create_db('toy.edit.gff3', dbfn=':memory:', **tool_compatibility.liftoff)

Such a module would be reasonably straightforward to maintain. The tricky part would be identifying the tools to support and tracking down example files to use for testing.

Thoughts? Or other ideas?

novigit commented 1 year ago

Part of the challenge with gffutils is covering all the corner cases simultaneously...your solution will work for this GFF, but I think it may fail in cases where there are identical IDs and positions (but perhaps the feature type or other attributes differ).

Good point.

A set of tool specific functions / arguments sounds like a good idea to me. Even with the GFF3 standardization, lots of tools still seem to have their own quircky behaviours. Like you said it may be impossible to have a universal function that works for all of them.