databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Release `v0.0.13` -- Add fragment file tokenizer #24

Closed nleroy917 closed 1 month ago

nleroy917 commented 1 month ago

This PR adds a new FragmentTokenizer which will spit out .gtok files directly for barcoded cells inside a fragments.tsv.gz file. It's got the ability to filter cells, too, if you have prior knowledge of high-quality versus low-quality cells. This drastically speeds up the tokenization process too.

I've added a super simple python implementation in the bindings too:

from genimtools.tokenizers import FragmentTokenizer
from genimtools.utils import read_tokens_from_gtok

t = FragmentTokenizer("path/to/universe.bed")

filter = open("cell_filter.txt", "r").read().splitlines() # ["AATGGTCGTAGA", ... ,"CTAGTGCATGATAC"]

t.tokenize_fragments(
    "path/to/fragments.tsv.gz",
     out_path = "gtokens",
    filter=filter
)

read_tokens_from_gtok("gtokens/AATGGTCGTAGA.gtok") # [42, 101, 999]
nleroy917 commented 1 month ago

Also, this fixes a critical bug that broke tokenization by messing with the token index count