Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.
From the meeting, it was noted that smaller regions would show up more often than large regions since their overlap percentage would always be larger (they are smaller)
It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.
Here is a rust crate that will let you sample form distributions: https://docs.rs/rand_distr/latest/rand_distr/
I would use it similarly in Python: