daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
288 stars 78 forks source link

Insert features at a specific level without destroying structure? #84

Closed olgabot closed 2 years ago

olgabot commented 8 years ago

okay lots of questions from me ...

I'd like to add another layer of a splicing_event feature to an existing gffutils database, that is between the transcript and exon level. The reason for this is because I want to annotate alternative splicing events with their protein sequences and such that is hard to obtain when you have the events on their own, but is much easier when the events are integrated with the transcripts, exons, and CDSs.

Here is a drawing of what I mean.

img_20160904_131333

The splicing_event feature type is a first-level child of the transcript and a first-level parent of the exon. So then would exons become second-level children of the transcripts? But what would happen to exons that aren't in a splicing event? Would there need to be addition of a "phantom" splicing event to make sure there's still something that exists on that level?

I tried doing this using an existing database and got this error: sqlite3.IntegrityError: UNIQUE constraint failed: relations.parent, relations.child, relations.level, which I take it is because I'm disrupting the level 1 relationship of the exons to the transcripts by inserting this extra event.

What would be the "right" way to do this?

Traceback (most recent call last):
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/pdb.py", line 1661, in main
    pdb._runscript(mainpyfile)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/pdb.py", line 1542, in _runscript
    self.run(statement)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/bdb.py", line 431, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 3, in <module>
    import argparse
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 655, in main
    cl = CommandLine(sys.argv[1:])
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 197, in __init__
    self.args.func()
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 201, in index
    index.execute()
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 498, in execute
    self.make_events_by_traversing_graph(event_maker, db)
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 412, in make_events_by_traversing_graph
    events_of_type, db, splice_abbrev.lower(), event_db_filename)
  File "/Users/olga/workspace-git/outrigger/outrigger/index/events.py", line 166, in event_df_to_gff
    db.add_relation(transcript_id, event, level=1)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/interface.py", line 902, in add_relation
    VALUES (?, ?, ?)''', (parent.id, child.id, level))
sqlite3.IntegrityError: UNIQUE constraint failed: relations.parent, relations.child, relations.level
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/interface.py(902)add_relation()
-> VALUES (?, ?, ?)''', (parent.id, child.id, level))
daler commented 8 years ago

Hmm. Currently I'm using levels 1 and 2 in the relations table. The relations table where level = 1 is what defines the DAG (direction encoded by which is "parent" and which is "child"). The entries where level = 2 are essentially a shortcut to avoid traversing the DAG when doing the domain-specific task of getting exons of a gene. But those 1s and 2s are nominal rather than ordinal values -- things would work the same if I had called them "a" and "b". So you might be able to get away with adding a level 3 to indicate splicing events if it was useful to sort of step outside the hierarchy.

Or another angle: your graph shows CDSs as children of exons but in most GFF/GTF files I've seen, CDSs are siblings of exons and both are children of transcripts. So there is already support for e.g. multiple level-1 children. I think the last 2 rows of this table are all you have to add:

parent child level
gene transcript 1
transcript exon 1
transcript cds 1
transcript splice 1
splice exon 1

I think this should work at least in principle. Can you provide a small example that gives you the error you're seeing?

daler commented 2 years ago

Closing, but please reopen if this continues to be a need and gffutils doesn't support it.