databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Python bindings memory leak and performance issues #21

Closed nleroy917 closed 1 month ago

nleroy917 commented 1 month ago

The problem

With the latest release (v0.0.11), I think I introduced a memory leak and severe performance degradation in the python bindings. When tokenizing single-cell datasets (AnnData) objects... I notice two things: 1) memory use starts to explode, and 2) it is very, very slow.

The source of slowness

Through investigation and experimentation, I think that I have narrowed it down to the creation of TokenizedRegionSet structs inside the python bindings. Here is the code I call when tokenizing:

pub fn __call__(&self, regions: &Bound<'_, PyAny>) -> Result<PyTokenizedRegionSet> {
    // attempt to map the list to a vector of regions
    let rs = extract_regions_from_py_any(regions)?;

    // tokenize the RegionSet
    let tokenized = self.tokenizer.tokenize_region_set(&rs);

    Ok(tokenized.into())
}

Without getting too much into it... self.tokenizer.tokenize_region_set(&rs) will return a TokenizedRegionSet. This is a struct that exists in the core genimtools crate. It needs to be "python-ified" such that we can return it to Python (i.e. it must be turned into a PyTokenizedRegionSet). To facilitate this, I implemented the From trait for PyTokenizedRegionSet like so:

impl From<TokenizedRegionSet<'_>> for PyTokenizedRegionSet {
    fn from(value: TokenizedRegionSet) -> Self {
        PyTokenizedRegionSet {
            ids: value.ids,
            universe: value.universe.to_owned().into(),
            curr: 0,
        }
    }
}

And now we can call into on a TokenizedRegionSet and it will be converted into the proper type of PyTokenizedRegionSet before returning.

What's interesting about this, is that the slowest part here is the tokenized.into() call. So the actual tokenization is very fast, but converting to the correct type is slow.

extract_regions_from_py_any: 425 µs
tokenize_region_set: 604 µs
tokenizd.into(): 404 ms // <---- almost 1000x slower than actual tokenization

The reason for the slowness

You'll notice in my above From implementation that I call .into() on the TokenizedRegionSets universe. This is because I need to convert the core Universe struct into yet another "python-ified" PyUniverse... I'll skip the details but about 300 of those 400 milliseconds are used to do this. Half a second for each cell, multiplied across a hundred thousand cells is very very slow. To get around rusts borrow checker, I am just cloning the Universe for each PyTokenizedRegionSet that we spit out -- so inefficient.

This is not a problem in the core crate

In the core crate... luckily, I was smart enough to know that this was a very bad idea. Look at the definition of the TokenizedRegionSet struct:

pub struct TokenizedRegionSet<'a> {
    pub ids: Vec<u32>,
    pub universe: &'a Universe,
}

We just hold a reference to the Universe so one doesn't need to clone anything. Using lifetimes, I denote that the TokenizedRegionSet is valid as long as the Universe is valid. Ok, so lets do that in the bindings too... the issue is

Lifetimes are forbidden in pyo3

This code won't compile:

#[pyclass(name = "TokenizedRegionSet")]
#[derive(Clone, Debug)]
pub struct PyTokenizedRegionSet<'a> {
    pub ids: Vec<u32>,
    pub universe: &'a PyUniverse,
    curr: usize,
}

it even links to some convenient documentation:

Rust lifetimes are used by the Rust compiler to reason about a program's memory safety. They are a compile-time only concept; there is no way to access Rust lifetimes at runtime from a dynamic language like Python.

As soon as Rust data is exposed to Python, there is no guarantee that the Rust compiler can make on how long the data will live. Python is a reference-counted language and those references can be held for an arbitrarily long time which is untraceable by the Rust compiler. The only possible way to express this correctly is to require that any #[pyclass] does not borrow data for any lifetime shorter than the 'static lifetime, i.e. the #[pyclass] cannot have any lifetime parameters.

When you need to share ownership of data between Python and Rust, instead of using borrowed references with lifetimes consider using reference-counted smart pointers such as Arc or Py.

The solution

It seems that the solution here is to just use the Py struct to wrap the Universe in some sort of Python-specific shared reference: https://docs.rs/pyo3/0.21.2/pyo3/struct.Py.html

However, I am still trying to un for help.erstand that documentation and implement it. If I don't get to it beforehand, I plan to bring all this to Rust Club.

Intermediate solutions

For now, one can interact directly with tokenize and encode to get the form they want (regions or ids respectively) as these do not require cloning the universe.

nleroy917 commented 1 month ago

Solution was to lift the universe up to the PyTreeTokenizer, like so:

#[pyclass(name = "TreeTokenizer")]
pub struct PyTreeTokenizer {
    pub tokenizer: TreeTokenizer,
    pub universe: Py<PyUniverse>,
}

Then when we actually convert TokenizedRegionSet to PyTokenizedRegionSet, just clone the reference to the universe on the Py-wrapped struct that sits on the tokenizer with clone_ref:

pub fn __call__(&self, regions: &Bound<'_, PyAny>) -> Result<PyTokenizedRegionSet> {
    // attempt to map the list to a vector of regions
    let rs = extract_regions_from_py_any(regions)?;

    // tokenize the RegionSet
    let tokenized = self.tokenizer.tokenize_region_set(&rs);

    Python::with_gil(|py| {
        let py_tokenized_region_set = PyTokenizedRegionSet {
            ids: tokenized.ids,
            curr: 0,
            universe: self.universe.clone_ref(py),
        };

        Ok(py_tokenized_region_set)
    })

}