Closed nleroy917 closed 1 month ago
Solution was to lift the universe up to the PyTreeTokenizer
, like so:
#[pyclass(name = "TreeTokenizer")]
pub struct PyTreeTokenizer {
pub tokenizer: TreeTokenizer,
pub universe: Py<PyUniverse>,
}
Then when we actually convert TokenizedRegionSet
to PyTokenizedRegionSet
, just clone the reference to the universe on the Py
-wrapped struct that sits on the tokenizer with clone_ref
:
pub fn __call__(&self, regions: &Bound<'_, PyAny>) -> Result<PyTokenizedRegionSet> {
// attempt to map the list to a vector of regions
let rs = extract_regions_from_py_any(regions)?;
// tokenize the RegionSet
let tokenized = self.tokenizer.tokenize_region_set(&rs);
Python::with_gil(|py| {
let py_tokenized_region_set = PyTokenizedRegionSet {
ids: tokenized.ids,
curr: 0,
universe: self.universe.clone_ref(py),
};
Ok(py_tokenized_region_set)
})
}
The problem
With the latest release (
v0.0.11
), I think I introduced a memory leak and severe performance degradation in the python bindings. When tokenizing single-cell datasets (AnnData
) objects... I notice two things: 1) memory use starts to explode, and 2) it is very, very slow.The source of slowness
Through investigation and experimentation, I think that I have narrowed it down to the creation of
TokenizedRegionSet
structs inside the python bindings. Here is the code I call when tokenizing:Without getting too much into it...
self.tokenizer.tokenize_region_set(&rs)
will return aTokenizedRegionSet
. This is a struct that exists in the coregenimtools
crate. It needs to be "python-ified" such that we can return it to Python (i.e. it must be turned into aPyTokenizedRegionSet
). To facilitate this, I implemented theFrom
trait forPyTokenizedRegionSet
like so:And now we can call
into
on aTokenizedRegionSet
and it will be converted into the proper type ofPyTokenizedRegionSet
before returning.What's interesting about this, is that the slowest part here is the
tokenized.into()
call. So the actual tokenization is very fast, but converting to the correct type is slow.The reason for the slowness
You'll notice in my above
From
implementation that I call.into()
on theTokenizedRegionSet
s universe. This is because I need to convert the coreUniverse
struct into yet another "python-ified"PyUniverse
... I'll skip the details but about 300 of those 400 milliseconds are used to do this. Half a second for each cell, multiplied across a hundred thousand cells is very very slow. To get around rusts borrow checker, I am just cloning theUniverse
for eachPyTokenizedRegionSet
that we spit out -- so inefficient.This is not a problem in the core crate
In the core crate... luckily, I was smart enough to know that this was a very bad idea. Look at the definition of the
TokenizedRegionSet
struct:We just hold a reference to the
Universe
so one doesn't need to clone anything. Using lifetimes, I denote that theTokenizedRegionSet
is valid as long as theUniverse
is valid. Ok, so lets do that in the bindings too... the issue isLifetimes are forbidden in pyo3
This code won't compile:
it even links to some convenient documentation:
The solution
It seems that the solution here is to just use the
Py
struct to wrap theUniverse
in some sort of Python-specific shared reference: https://docs.rs/pyo3/0.21.2/pyo3/struct.Py.htmlHowever, I am still trying to un for help.erstand that documentation and implement it. If I don't get to it beforehand, I plan to bring all this to Rust Club.
Intermediate solutions
For now, one can interact directly with
tokenize
andencode
to get the form they want (regions or ids respectively) as these do not require cloning the universe.