Best way to visualized "bias-corrected" insertion counts?

maxdudek commented 5 months ago

Hi!

I would like to make a visualization of my insertions counts at a certain region like this:

However, I want the number of insertions to be corrected based on the Tn5 bias calculated by the CNN. What is the best way to do this using PRINT? My first thought was to divide the insertion count at each bp by the bias at that bp, but I wanted to make sure that makes sense. So for my first peak/region, with length 1000, it would look something like this:

PEAK_NUM <- 1

# Load insertion count data for peak #1
chunk_1 <- readRDS("data/HepG2/chunkedCountTensor/chunk_1.rds")
peak_1_df <- chunk_1[[PEAK_NUM]]

# Convert data frame into vector of insertion counts
raw_insertions <- rep(0, 1000)
raw_insertions[peak_1_df$position] <- peak_1_df$count

# Correct for bias?
bias <- regionBias(project)[PEAK_NUM,]
corrected_insertions <- raw_insertions / bias

EDIT: my second, probably better thought, is to calculate the "expected insertions" using the bias and subtract that from the actual insertions. Does this make sense? Not sure if I'm interpreting the bias correctly

expected_insertions <- bias/sum(bias) * sum(raw_insertions)

corrected_insertions <- raw_insertions - expected_insertions

HYsxe commented 5 months ago

Hi Max,

This is a very thoughtful question and we have debated about this in our lab too! My current take is that I would avoid dividing by bias since sometimes you can have very low bias values like 0.01. When we call footprints, values like 0.01 and 0.005 will give you similar footprint results, but if you use them as denominator for visualizing counts it can scale your values by several folds and the visualization will be very confusing. This is the reason why if you look at our code we never divide any count by bias, but always compare insertions to an expected distribution. Hence I think your second idea is better. One thing to note is that when we trained our bias model the bias is in fact relative bias within a local window (+/-50 bp) so you might want to match this when you calculate the expectation.

Yan

maxdudek commented 5 months ago

Thank you! I'll go with the second one.

HYsxe / PRINT

Best way to visualized "bias-corrected" insertion counts? #19