databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
3 stars 2 forks source link

Implement a soft tokenizer #7

Open nleroy917 opened 11 months ago

nleroy917 commented 11 months ago

It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.

Here is a rust crate that will let you sample form distributions: https://docs.rs/rand_distr/latest/rand_distr/

I would use it similarly in Python:

tokenizer = SoftTokenizer("path/to/universe.bed")
rs = RegionSet("path/to/file.bed")

tokens = tokenizer.tokenize(rs)

x = torch.tensor(tokens.to_ids())

out = model(x)

print(out)
nleroy917 commented 11 months ago

From the meeting, it was noted that smaller regions would show up more often than large regions since their overlap percentage would always be larger (they are smaller)