deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

How to remove inter-chromosomal interactions from matrix #704

Closed cgirardot closed 3 years ago

cgirardot commented 3 years ago

Dear, this is not an issue but I cannot find a way to filter out the inter-chromosomal interactions from a h5/cool matrix. Actually what I'd like to do is to hicNormalize a list of matrices to the smallest counts but only taking the intra-chromosomal interactions. Since hicNormalize does not have this option, I tried to find a way to turn the inter-chromosomal interactions to 0 before running hicNormalize. I looked into hicAdjustMatrix but did not manage.

Is there a way to do this?

On a more general aspect, are people doing hicNormalize with all the interactions? I thought the rate of inter-chr interactions was kinda proxy for quality and varies between samples so I was expecting such interactions to be removed for normalization (and balancing?). Any hint would be appreciated.

joachimwolff commented 3 years ago

Dear, this is not an issue but I cannot find a way to filter out the inter-chromosomal interactions from a h5/cool matrix. Actually what I'd like to do is to hicNormalize a list of matrices to the smallest counts but only taking the intra-chromosomal interactions. Since hicNormalize does not have this option, I tried to find a way to turn the inter-chromosomal interactions to 0 before running hicNormalize. I looked into hicAdjustMatrix but did not manage.

Is there a way to do this?

As I consider it at the moment, this might be a missing feature of HiCExplorer. I will add it to our to do list. Workaround in the meantime could be to generate one matrix per chromosome, get the sum of the values via hicInfo, and calculate for each matrix the sum, the correction factor and apply hicNormalize with the user given correction factors.

On a more general aspect, are people doing hicNormalize with all the interactions? I thought the rate of inter-chr interactions was kinda proxy for quality and varies between samples so I was expecting such interactions to be removed for normalization (and balancing?). Any hint would be appreciated.

You cannot make here a generalized statement, it depends on the quality of your Hi-C and also on the organism which regions should be taken into account, and which should be removed. You cannot just remove the inter-chromosomal interactions as a default. However, if you care for the quality of Hi-C it is better to work on the data generation and not on the algorithms. Use Arima Hi-C or similar methods with two or more restriction enzymes and not older methods with only one.

However, removing certain inter-chromosome regions will not influence the results of a normalization in a large manner. If your wet-lab protocol is the same one, the noise ratio should be very similar between samples. But I am more an algorithm guy, and not that much an analysis expert, maybe I am wrong on this. I will make some tests and come back to you. Maybe @LeilyR or @lldelisle can make here a statement.

lldelisle commented 3 years ago

Hi, I also got an issue with a sample processed the same time as others with much more trans contacts compared to another one (biology...). I manually removed the trans interactions from the matrix. It improved the results. You can remove the trans from a cool matrix this way:

cooler dump --join ${input_mat}  | awk '$1==$4 {print}' | cooler load --format bg2 "${my_sizes}:${bin}000" - ${filtered_cool}
cgirardot commented 3 years ago

thank you @joachimwolff and @lldelisle for your answers. It helps to know that removing the inter-chromosome reads makes sense (at least when one does not expect this variation to be biological). I also figured out the one-liner with cooler but I was trying to stick to galaxy and include this filtering in my WF.

I definitely think adding this feature makes sense. Maybe 2 features would actually make sense:

For the moment, it seems easier to do this outside galaxy. In case this is useful to potential readers, this is how one can count interactions falling in a list of captured regions and sums up the reads:

cooler dump --join $matrice.cool | tee \
                                 >(bedtools intersect -a stdin -b $regions.bed -wa | awk 'END { print s } { s += $7 }') \
                                 >(cut -f 4,5,6,7 | bedtools intersect -a stdin -b $regions.bed -wa | awk 'END { print s } { s += $4 }') \
                                 > /dev/null \
                              | awk -v OFS=\\t 'END { print s } { s += $1 }'
cgirardot commented 3 years ago

also to give you a bit of background here. In one project, we do normal HiC in a developmental time course (fly) and here I am not so sure the observed inter-chromosomal differences are only technical. I will try to exclude inter-chromosomal interactions to see. In another project, we do capture HiC (on large regions) in different conditions and here I want to consider reads with one end falling in captured regions to compute the norm factors.