databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Inconsistent region numbers #10

Open gtzheng opened 5 months ago

gtzheng commented 5 months ago

Here is the code I used for tokenization:

from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.io import RegionSet
import numpy as np

bed_file = '/project/shefflab/data/encode/ENCFF001SMI.bed.gz'
universe_file = '/project/shefflab/data/screen/cCREs.bed'

rs = RegionSet(bed_file)
tokenizer = ITTokenizer(universe_file)

print("number of regions:", len(rs))

all_tokens = tokenizer.tokenize(rs)
print("number of regions after tokenization:", len(all_tokens))

sel_indexes = np.random.permutation(len(rs))[0:200]
sel_rs = RegionSet([rs[i] for i in sel_indexes])
print("number of selected regions:", len(sel_rs))
tokens = tokenizer.tokenize(sel_rs)
print("number of selected regions after tokenization:", len(tokens))

The numbers of regions after tokenization seem inflated.

number of regions: 1045378
number of regions after tokenization: 1055185
number of selected regions: 200
number of selected regions after tokenization: 201
nleroy917 commented 5 months ago

Hey @gtzheng Thanks for opening this up. I guess I have two thoughts:

  1. There definitely could be a bug or quirk and I'd love to investigate the tokenizers more to ensure they are working properly,
  2. One reason you might experience an inflated amount of regions is because when tokenizing... I collect all regions it overlaps... if that makes sense. Here is a graphic: image

One region in the query ends up overlapping with two universe regions so you will tokenize a single region and get back two tokens.

One thing I like to do as a sanity check is to use bedtools overlap to make sure the tokenizer is hitting things properly.

Let me know if you have any other thoughts!

gtzheng commented 5 months ago

I see. Thanks for the rely! It does make sense to see inflated numbers if all overlapped regions are collected. Would it also be possible to return the most overlapped regions, so the output has the same number of regions as the input?

nleroy917 commented 5 months ago

Yeah, one option is just to return the first region it overlaps. I like the idea of returning the single one. It overlaps with the most, however.

In terms of signal, biological meaning however - I'm not sure which is better...

gtzheng commented 5 months ago

Perhaps make it an option for users to choose.

Would it be possible to get how much two regions overlap? If so, we can use that information for soft tokenization.