4dn-dcic / hic2cool

Lightweight converter between hic and cool contact matrices.
MIT License
66 stars 7 forks source link

how to output normalization counts? #22

Closed zhixuqiu closed 5 years ago

zhixuqiu commented 5 years ago

Hi, I converted my Hi-C matrix from heic format to cool format. But the reads counts of new cool file were original reads number not the normalization values. Could you tell me how to output normalization counts? My command is

hic2cool ourhicmat.hic  ourhic10kb -r 10000 
cooler dump  -t pixels --header --join -r 1:1-100000 -r2 1 ourhic10kb.cool |head
chrom1  start1  end1    chrom2  start2  end2    count
1   0   10000   1   0   10000   194
1   0   10000   1   10000   20000   127
1   0   10000   1   20000   30000   211
1   0   10000   1   30000   40000   124
1   0   10000   1   40000   50000   38
1   0   10000   1   50000   60000   33
1   0   10000   1   60000   70000   13
1   0   10000   1   70000   80000   21
1   0   10000   1   80000   90000   16
DoaneAS commented 5 years ago

I'm also wondering if hic2cool applies matrix balancing normalization using cooler. I generated a .multi.cool file from a .hic file, and it includes all the normalizations that were present in the .hic file, plus a normalization called "default" which was not in the .hic file.

sergpolly commented 5 years ago

Cooler stores raw bin counts , and applies balancing weights "on the fly" as needed... Balancing weights are stored as part of bins table. Check cooler schema for reference: https://cooler.readthedocs.io/en/latest/schema.html#data-collection

For example, balancing is automatically applied when viewing .mcool in HiGlass , or fetching snippets of the matrix using cooler Python API : https://cooler.readthedocs.io/en/latest/api.html#cooler.Cooler.matrix

Thus hic2cool behaviour is what you'd expect . Also beware that .hic usually carry a bunch of different balancing weights, so you'd get all of them in your cooler and by default - a default one is used - to use some other one use - look for specific option in cooler CLI commands , or do something balance="weightcol_name" when fetching snippets of a matrix in your custom Python scripts

Hope this helps

carlvitzthum commented 5 years ago

Thank you @sergpolly for the explanation.

@DoaneAS, could you send me the command you used to interrogate your hic2cool .multi.cool file? I don't see a normalization named "default" in bins table of my output files, just the weights corresponding to .hic normalization vectors (KR, VC, VC_SQRT...)

nvictus commented 5 years ago

Hi @DoaneAS. As @sergpolly pointed out, all normalization vectors are stored in the bin table, separately from the raw counts.

@carlvitzthum, I believe the "default" comes from HiGlass's transforms menu. HiGlass will default to applying the normalization vector called weight if it exists (listed as ICE in HiGlass), otherwise the "default" normalization should be "None", i.e. raw counts.

weight is the default name of the output from running cooler balance, which does standard matrix balancing (same as KR, and the results should be about the same up to bin-level filtering and the value scale -- the default normalization done by cooler is genome-wide and rescales the weights such that the marginals sum to 1. However, all of this can be customized).

One notable difference is that cooler uses multiplicative weights. So the balanced values = count * weight1 * weight2. hic vectors are divisive biases, so balanced = count / bias1 / bias 2. hic2cool currently inverts the hic biases into multiplicative weights -- this is unfortunately inconsistent with HiGlass which expects KR,VC,VCSQRT to be divisive, not multiplicative. So if KR, etc. look funny in HiGlass, that would be the issue.

FYI @carlvitzthum will be deprecating this inversion behavior in the next version of hic2cool, and provide a way to re-extract the hic norms as divisive ones. In the meantime, you can try re-inverting those columns back manually in Python using h5py (I can send you a short script to do this), or just run cooler balance on each zoom level.

carlvitzthum commented 5 years ago

I have just released version 0.5.0, which deprecates the inversion of hic vectors in the output cooler files. Please run hic2cool update <cooler filepath> to get your files caught up. See the docs for more info.

Closing this issue. Please re-open a new one if something comes up when using the new version.