Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
3
stars
2
forks
source link
Implement a `return_tensors="pt"` function for the Tokenizers #12
I'm trying to optimize some things for more efficient processing of bed files through our tokenizers and models in actual production environments (like bedbase).
One bottleneck I encounter is creating tensors from lists of integers. I explain more detail in a PR over in geniml but, briefly, the current tokenizers are only capable of returning lists of integers for tokenized BED files. It could be more efficient to emit a Tensor directly. I think that this is possible using some combination of the following rust crates:
With this, users can just return a torch.Tensor object directly and there is no need to convert between types -- potentially saving time. Additionally, we could offer options for returning np.array objects with rust-numpy.
I'm trying to optimize some things for more efficient processing of bed files through our tokenizers and models in actual production environments (like bedbase).
One bottleneck I encounter is creating tensors from lists of integers. I explain more detail in a PR over in
geniml
but, briefly, the current tokenizers are only capable of returning lists of integers for tokenized BED files. It could be more efficient to emit aTensor
directly. I think that this is possible using some combination of the following rust crates:tch
tch-ext
pyo3-tch
With this, users can just return a
torch.Tensor
object directly and there is no need to convert between types -- potentially saving time. Additionally, we could offer options for returningnp.array
objects withrust-numpy
.