Open gtzheng opened 5 months ago
Hey @gtzheng Thanks for opening this up. I guess I have two thoughts:
One region in the query ends up overlapping with two universe regions so you will tokenize a single region and get back two tokens.
One thing I like to do as a sanity check is to use bedtools overlap
to make sure the tokenizer is hitting things properly.
Let me know if you have any other thoughts!
I see. Thanks for the rely! It does make sense to see inflated numbers if all overlapped regions are collected. Would it also be possible to return the most overlapped regions, so the output has the same number of regions as the input?
Yeah, one option is just to return the first region it overlaps. I like the idea of returning the single one. It overlaps with the most, however.
In terms of signal, biological meaning however - I'm not sure which is better...
Perhaps make it an option for users to choose.
Would it be possible to get how much two regions overlap? If so, we can use that information for soft tokenization.
Here is the code I used for tokenization:
The numbers of regions after tokenization seem inflated.