databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
3 stars 1 forks source link

Implement IGD #28

Open nleroy917 opened 3 months ago

nleroy917 commented 3 months ago

We need to re-implement IGD in this crate. Being done by @donaldcampbelljr in #9

Original code here: https://github.com/databio/IGD

nleroy917 commented 3 months ago

For python bindings, we could do an OOP approach:

from gtars.igd import Igd

igd = Igd.create_from_files(
    source_files="path/to/files",
    output_folder="path/to/output",
    database_name="mydb"
)

# way later
igd = Igd.load_db("path/to/database)
idg.search(...)
donaldcampbelljr commented 2 months ago

IGD create and search now work in PR #9 with some caveats.

An IGD database can be created from a folder full of bedfiles. A search can be performed using a single bed file as the query.

Performance-wise, creation appears to be similar for C and Rust versions (80 files, ~280,000 regions) at 2.1 seconds.

There are some discrepancies between the C version that should be investigated in the future such as:

image

donaldcampbelljr commented 1 month ago

I just merged the PR that has been in progress since beginning of the year. However, IGD still needs some work.