databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Implement an `AnnData` tokenizer #14

Open nleroy917 opened 2 months ago

nleroy917 commented 2 months ago

While the TreeTokenizer wrapper in geniml (ITTokenizer) is nice because it is abstract and can tokenize BED files and AnnData objects, I think that it makes more sense to just create a separate AnnData tokenizer. That way, we might not need a a wrapper in geniml and can just use the tokenizers directly when there is separation of concern.

It can still use an interval tree internally, but it will explicitly look for AnnData objects instead of bed files.