TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.86k stars 185 forks source link

Tile_DB & Neo4j? #1820

Open olszewskip opened 4 years ago

olszewskip commented 4 years ago

Hi! Sorry if this is not the right place to ask this, or if my question is not specific enough: I'm looking around for a solution to work with genomic data, primarily VCFs, and Tile_DB struck me as a really cool solution, due to its emphasis on being able to exploit sparseness and modelling data as multidimensional matrices. But then I would also like to somehow integrate the data encoded by a VCF with inherently graph- or tree-like data, e.g. biological ontologies or protein-interaction data. An example of the latter is https://het.io/about/, which is a Neo4j database. There are even apparently applications where it makes sense to import the whole VCF to Neo4j as well: https://github.com/phenopolis/pheno4j. More generally, I think, it makes sense to equate a matrix that is sparse in its first two dimensions and a graph (with the sparse matrix being the adjacency matrix of the graph). My questions are:

stavrospapadopoulos commented 4 years ago

Hi @olszewskip, thanks for reaching out!

Regarding VCF data, have you checked https://github.com/TileDB-Inc/TileDB-VCF? We are quite actively developing it.

Regarding representing graph data as sparse adjacency matrices, this is of great interest to me (and the original motivation behind TileDB). Both exporting a TileDB adjacency matrix to neo4j, as well as starting to implement graph algorithms (via sparse Linear Algebra) are very interesting. We have been having such discussions internally for some time, we'll add some initial implementations in our roadmap. Of course we always welcome contributions and are open to feature design discussions. Also you could add feature requests here: https://feedback.tiledb.com/

Thanks!

olszewskip commented 4 years ago

Awesome! Thank You for the references (I've read some documentation before. The discussion in https://docs.tiledb.com/genomics/storing-variants-as-arrays is particularly nice), and for the sanity check. Glad to know, the tile_DB team is having this direction in mind. Being able to integrate heterogeneous sparse data and programmatically construct and run efficient queries on it from python (or julia?) would be just mindblowing :)