Struggling to interpret `hic_compare()` results table..

serine commented 4 years ago

Hi there,

for some reason I struggling to interpret hic_compare() results table.. If we just look at your example of these two cell lines

data("HMEC.chr22")
data("NHEK.chr22")

what we have are two different cell lines and we are trying to figure out which regions of the chromosome are different right? I'm used to working with expression data where interpretation is "straightforward" - a gene has gone up or down relative to another sample (base line). However looking at hic.table I'm not sure what the interpretation should be. Below I'm showing three "differential regions" from example data set and I don't know whether I should interpret this as region1 (start1:end1) is "very" different in terms of number of contacts to region2 (start2:end2) ? So looking at the second line below, region1 (19000000:19500000) in HMEC cell line is very different to region2 (48000000:48500000) in a different NHEK cell line.. but those are two region that are 29Mb apart in different cell lines? I'm not sure why should you expect those two regions to be the same in different cell lines? I guess the results table that I was expecting is a single column of regions that are different in NHEK cell line relative to baseline (HMEC cell line)? I'm sure that I'm just not getting Hi-C data yet and would appreciate some help here, thanks

 hic.table %>% as_tibble() %>% filter(p.adj < 0.05) %>% arrange(start1, end1) %>% select(chr1, start1, end1, start2, end2) %>% head(n = 3)
# A tibble: 3 x 5
  chr1    start1     end1   start2     end2
  <chr>    <int>    <int>    <int>    <int>
1 chr22 16500000 17000000 17000000 17500000
2 chr22 19000000 19500000 48000000 48500000
3 chr22 19500000 20000000 48000000 48500000

mdozmorov commented 4 years ago

This is a perfect question for the BioC2020 workshop, https://github.com/mdozmorov/HiCcompareWorkshop. I hope you can join, the data format and hic.table will be discussed here.

As of immediate advice, if you look at all columns (don't use select), columns would make more sense. See the workshop intro about data formats https://mdozmorov.github.io/HiCcompareWorkshop/articles/hic_tutorial.html, and the HiCcompare vignette itself https://www.bioconductor.org/packages/release/bioc/vignettes/HiCcompare/inst/doc/HiCcompare-vignette.html

serine commented 4 years ago

@mdozmorov thanks for pointer.. I think I need more time to understand this and perhaps work through your multiHiCcompare example here

There is something magical about D value (distance off the diagonal) and differences between IF (M value)... I think understand that hic_table reports two regions that have high M with respect to D (right?) as "interesting". I just don't get why one region in one sample should have an effect on a different region in another sample..? I think hic_table telling me that those two regions in those two samples are "interesting" and I should look at each one of those regions independently. Can I just look at either region1 or region2 and ignore the other?

By the way, yes I'll be joining the conference and quite looking forward to the whole event and your workshop. Your workshop will be around 1 in the morning in my time zone, not too bad actually :D

Basically I'm working towards exactly those few things that you've mentioning in your workshop, overall with genes and promoters and enrichment testing.. This is why I'm asking which regions to use for gene annotation (will work through your examples)

thanks

serine commented 4 years ago

@mdozmorov I'm not sure why, but it was pretty hard to figure out this

> hic.table %>% filter(p.adj < 0.05) %>% arrange(start1, end1) %>% slice(15)
    chr1   start1     end1  chr2   start2     end2  IF1      IF2 D        M  adj.IF1  adj.IF2    adj.M        mc       A        Z     p.value      p.adj
1: chr22 48000000 48500000 chr22 48500000 49000000 2183 7019.283 1 1.685012 2269.657 6751.283 1.572687 0.1123242 4510.47 3.285699 0.001017295 0.03359926

> HMEC.chr22 %>% filter(region1 == 48000000, region2 == 48500000)
   region1  region2   IF
1 48000000 48500000 2183

> NHEK.chr22 %>% filter(region1 == 48000000, region2 == 48500000)
   region1  region2   IF
1 48000000 48500000 8094

I think I finally get hic_table, this region 48000000-48500000 is differential between two cell lines.

I'm still a little confused as to why end2 == 49000000 I'd have excepted it to be end2 == 48500000 ..? but I guess this isn't as important, thanks

dozmorovlab / HiCcompare

Struggling to interpret `hic_compare()` results table.. #17