daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
289 stars 78 forks source link

Graph database? #35

Open olgabot opened 10 years ago

olgabot commented 10 years ago

This is not an issue, more a question. It takes some serious SQL-wrangling to get parent-child or grandparent-child information about gene-transcript-exon relationships. Have you thought about using a graph database for gffutils? There doesn't seem to be a SQLite equivalent for Node.js or TitanDB so you wouldn't have to open up a separate port, so that could be a drawback.

yarden commented 10 years ago

It's a pain partly because of the GFF specification. GFFs only encode trees, so full graph support is not needed, but the format is bad at supporting them. To somewhat get around this, I clean up ("sanitize") all my GFFs to have a field that runs through each gene hierarchy (see gff_sanitize and core code here: http://pythonhosted.org/gffutils/autodocs/gffutils.helpers.sanitize_gff_db.html). This makes the files grep-able and easy to query with SQL. It makes the GFF more GTF-like. This only makes sense for canonical gene -> mRNA -> exons hierarchies.

Also, there's some support for iterating over parent-child pairs for canonical hierarchies that might make what you're trying to do easier: iter_by_parent_childs in http://pythonhosted.org/gffutils/autodocs/gffutils.FeatureDB.html

daler commented 10 years ago

I hadn't heard of graph databases until you brought them up. After reading up on them a little, I'm pretty sure they would provide a substantial performance boost. But I wasn't able to find a file-based implementation either, Python or otherwise. Currently for me, managing a separate graph database and server is too much overhead compared to the almost transparent method of using a file-based database.

As Yarden alluded, yes the SQL can be awkward. But ideally, as many manipulations as possible would be hidden to the end-user. In previous gffutils iterations, I had tried sqlalchemy to make the SQL a bit more straightforward, but I didn't consider the performance hit of the ORM overhead worth it. I had also tried loading the GFF into a graph structure (I think I had used networkx) and saving a pickle of it for persistence. But loading time turned out to be unacceptable, and the memory usage was another downside. So I went back to using good ol' hand-written queries with sqlite for performance.

Anyway, if I hit upon a use-case that's not already implemented, then I'll typically add a method to FeatureDB. If you have a specific task that's currently awkward/annoying to do in SQL, I'd be happy to add it as a method so others could benefit.

And if you ever find a file-based graph db, please let me know!

olgabot commented 9 years ago

Apparently there's a python implemented graph db that's a layer over SQLite: https://github.com/eugene-eeo/graphlite

daler commented 9 years ago

Thanks, nice find.

So to use this in gffutils it would take some playing around to figure out if 2 databases are needed or if graphlite can work with an existing db (my hunch is the latter based on the docs). Then any of the logic that touches the current relations table would be ported to use graphlite. Then we'd need benchmarks to figure out if there are performance gains that make the additional complexity worth it.

Have you run across cases where gffutils currently doesn't work well or that you think would benefit from a graph db?

olgabot commented 9 years ago

For me specifically, I operate mostly on exons so getting an exon from a particular location and all its transcripts and CDSs is a lot of what I do. I'll try to come up with a particular example that you can benchmark against. I'm annotating splicing events for my paper right now so this is great timing :)

On Fri, Aug 14, 2015 at 9:22 AM Ryan Dale notifications@github.com wrote:

Thanks, nice find.

So to use this in gffutils it would take some playing around to figure out if 2 databases are needed or if graphlite can work with an existing db (my hunch is the latter based on the docs). Then any of the logic that touches the current relations table would be ported to use graphlite. Then we'd need benchmarks to figure out if there are performance gains that make the additional complexity worth it.

Have you run across cases where gffutils currently doesn't work well or that you think would benefit from a graph db?

— Reply to this email directly or view it on GitHub https://github.com/daler/gffutils/issues/35#issuecomment-131167706.