bystrogenomics / bystro

Natural Language Search and Analysis of High Dimensional Genomic Data
Apache License 2.0
43 stars 14 forks source link

Investigate use of named databases #23

Open akotlar opened 6 years ago

akotlar commented 6 years ago

In this version, every track would get a separate named database, as opposed to a key in the serialized data structure.

The advantage is a substantially easier insertion model, which will allow us to modularly update the database.

The disadvantage may be read performance and size; each database will need a header; need to investigate size, but may be 16 bytes. Also, we will need to deserialize N times for N tracks, although the deserialization will be simpler.

If annotation performance or database size are substantially impacted, or this change significantly higher CPU usage during annotation, the tradeoff will likely not be worth it. Currently on master branch build times are 1 day with 3 additional whole-genome tracks (refSeq.gene, nearest.refSeq, nearestTss.refSeq), which cumulatively take ~ 7 hours. We re-run builds no more than once per month.

akotlar commented 6 years ago

This may be important for supporting modular databases, and databases that are customized per user.

akotlar commented 4 years ago

Addressed here, worth pulling in: https://github.com/akotlar/GenPro/blob/genpro-lmdb/lib/GenPro/DBManager.pm