Accounting for batch effects

3DGenomes / binless

Resolution-independent normalization of Hi-C data

GNU Lesser General Public License v3.0

7 stars 2 forks source link

Accounting for batch effects #14

Open bontus opened 5 years ago

bontus commented 5 years ago

Hi, I was wondering if there is a way to include batches in the binless analyses. I have 4 conditions, of which two have 3 replicates (control & treatment) and two have only 2 replicates (control + inhibitor as well as treatment + inhibitor). We are interested in detecting differences induced by treatment and dependent on the inhibitor but already noticed that one of our replicate batches clusters separately (globally the same changes are still visible though). Any advice is greatly appreciated! Best regards

yannickspill commented 5 years ago

It depends on what you exactly call batch effects. There is some accounting for that in binless, by default. However, could you maybe explain how you see the batch effect, i.e. does it affect the diagonal decay, the biases etc?

bontus commented 5 years ago

The decay values are indeed different (i.e. smaller in batch 3 compared to batch 1&2), and I mainly noticed the differences in downstream calculations when looking at TAD borders and compartment strength. However, I realize that my question was somewhat arbitrary as I am mostly interested in accounting for batch effects during the difference test implemented in binless. Basically, my question could be translated to: can _detect_binlessdifferences() use pairing information (akin to a paired t-test)? _read_andprepare()_ does provide the replicate parameter, but I did not see any other function make use of it. Best

yannickspill commented 5 years ago

In general, detect_binless_differences pairs the samples, so acts like a paired t-test, albeit more complicated because it takes into account the neighborhood of each pixel. In that sense, batch effects are already accounted for.

The replicate parameter in read_and_prepare serves essentially to have a different name for each sample. If you want to model a different decay, you could adapt the condition or enzyme fields of read_and_prepare, and then play with the different.decays argument of merge_cs_norm_datasets

Also, in difference detection, did you group your datasets before, or did you call differences in each dataset individually?

bontus commented 5 years ago

Also, in difference detection, did you group your datasets before, or did you call differences in each dataset individually?

I grouped them after normalization and before calling _detect_binlessinteractions().

The replicate parameter in read_and_prepare serves essentially to have a different name for each sample. If you want to model a different decay, you could adapt the condition or enzyme fields of read_and_prepare, and then play with the different.decays argument of merge_cs_norm_datasets

Alright, I will give that a try.

In general, detect_binless_differences pairs the samples, so acts like a paired t-test, albeit more complicated because it takes into account the neighborhood of each pixel. In that sense, batch effects are already accounted for.

That's great to hear, but I am still wondering which information is used to pair the samples if it is not explicitly provided by the user?

yannickspill commented 5 years ago

I am still wondering which information is used to pair the samples if it is not explicitly provided by the user?

For difference detection, data is grouped by square bins of size base.res, and compared two by two, taking into account neighbour information. That is done automatically, and does not require user input. A more stricter pairing, in the sense of a patient before and after treatment, would not make sense anyway in this context.