Implement a soft tokenizer

databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.

3 stars 2 forks source link

It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.

Here is a rust crate that will let you sample form distributions: https://docs.rs/rand_distr/latest/rand_distr/

I would use it similarly in Python:

tokenizer = SoftTokenizer("path/to/universe.bed")
rs = RegionSet("path/to/file.bed")

tokens = tokenizer.tokenize(rs)

x = torch.tensor(tokens.to_ids())

out = model(x)

print(out)

databio / gtars

Implement a soft tokenizer #7