databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Implement a `return_tensors="pt"` function for the Tokenizers #12

Open nleroy917 opened 3 months ago

nleroy917 commented 3 months ago

I'm trying to optimize some things for more efficient processing of bed files through our tokenizers and models in actual production environments (like bedbase).

One bottleneck I encounter is creating tensors from lists of integers. I explain more detail in a PR over in geniml but, briefly, the current tokenizers are only capable of returning lists of integers for tokenized BED files. It could be more efficient to emit a Tensor directly. I think that this is possible using some combination of the following rust crates:

With this, users can just return a torch.Tensor object directly and there is no need to convert between types -- potentially saving time. Additionally, we could offer options for returning np.array objects with rust-numpy.

nleroy917 commented 2 months ago

I've actually implemented a to_numpy() function, but to_tensor() might be a little complicated...