dozmorovlab / HiCcompare

Joint normalization of two Hi-C matrices, visualization and detection of differential chromatin interactions. See multiHiCcompare for the analysis of multiple Hi-C matrices
https://dozmorovlab.github.io/HiCcompare/
Other
18 stars 3 forks source link

High FPR #16

Closed csijcs closed 4 years ago

csijcs commented 4 years ago

Great package. I seem to be getting an extremely high false positive rate, and extremely low true positive rate when I run the filter_params function to determine filtering cutoffs. Its not clear to me why this is. I tried different resolutions and see the same effect. It's almost the opposite of what I would expect (or what I was hoping for). Do you have any idea what would cause this?

image

mdozmorov commented 4 years ago

It is highly unusual. Extremely sparse data, perhaps? Without seeing the MD plots it's hard to judge. Suggesting trying multiHiCcompare - it simplifies differential analysis. @jstansfield0, chip in if you have some ideas

csijcs commented 4 years ago

Thanks for the prompt reply. I don't think it's data sparsity, if anything maybe too much noise? The MD plot below is the same chromosome and resolution. I'll give mutiHiCcompare a try.

image

mdozmorov commented 4 years ago

The MD plot looks as expected, so are the significant differences. I don't understand why your A plot looks so weird. If you would like to share chr22 matrices for a couple of samples, we'll look into it. In a meantime, I recommend multiHiCcompare

csijcs commented 4 years ago

I previously uploaded the figure for chr1. I plotted chr22 and it looked pretty normal. Chr21 a bit less so. Then with chr20 I got some errors that I didn't notice before (but indeed to occur with the larger chrs):

Warning messages: 1: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : pseudoinverse used at 22 2: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : neighborhood radius 1 3: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : reciprocal condition number 0 4: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : There are other near singularities as well. 1

This repeats over many times with different values, and eventually becomes:

32: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : zero-width neighborhood. make span bigger

I can upload the chr20 hic.table if that works for you.

Also, I should have mentioned before, I'm using R v4.0.1. Perhaps something with the new version is not compatible?

mdozmorov commented 4 years ago

"make span bigger" should help. In hic_loess.R, set span parameter to 0.05, and play with this number.

As for the data, please, provide a fully reproducible piece of code and data that would create the strange A plot. You can send the data directly to mdozmorov at vcu . edu

yeyuan98 commented 4 years ago

Hi Prof. Dozmorov,

I also got high FPR at higher A filtering settings in the filter_params plot using raw interaction matrix at 50kb resolution. I tried iced interaction matrix and got similar filter_params plot. I didn't get any warnings/messages during the analysis and the plot is attached. I would try multiHiCcompare shortly, and meanwhile any insights on what's going on would be very helpful.

Below is the piece of code I used after I got a data list which contains two datasets named A and B respectively using cooler2bedpe. After read-in each dataset contains cis and trans interactions.

printANDhictab <- function(p1,p2){
  cat("Processing Chr pair: ",as.character(p1$chr1[1])," X ",as.character(p2$chr1[1]),"\n")
  create.hic.table(p1,p2,scale=FALSE)
}
hic.list <- mapply(printANDhictab,data$A$cis,data$B$cis,SIMPLIFY=FALSE)
hic.list <- total_sum(hic.list)
hic.list <- hic_loess(hic.list, parallel = TRUE)
filter_params(hic.list[[1]], numChanges = 1000, Plot = TRUE)

Thanks! filter_params_50kb filter_params_loess_50kb

mdozmorov commented 4 years ago

Thanks, @yeyuan98, for reporting, and sorry, @csijcs, for the delay replying. This behavior may be due to the data being good (that is, not sparse). If one introduces relatively small changes into data that have wide dynamic range already, these changes won't be detected and FPR will look wild. Try to increase the FC parameter, e.g., filter_params(mtx, FC = 5), and, potentially, increase numChanges. Given the data is good, the A-plot is actually not necessary. It is OK to use the default when calling hic_compare.

yeyuan98 commented 4 years ago

Thanks for the clear explanation! Just curious, could you share a bit detail on how the FPR is calculated in the package? Sorry if this has an obvious answer as I have no biostatistics background.

Also, could you provide a bit detail on how are the p-values adjusted? What I could understand is that, assuming M is approximately normal, we can get a p-value using inverse cumulative distribution function.

BTW, your manuscript on HiCcompare is really well-written and clear :)

mdozmorov commented 4 years ago

The False Positive Rate is calculated after we introduce numChanges true positives (TP). The data is typically noisy, and, when detecting differences, we will detect most of our TPs, but there will be false detections, false positives. These are used to calculate the FPR. So, when I recommended increasing the FC parameter, it will help the detection of TPs and make a reasonable A-plot.

When calculating p-values, M-values are transformed using Z-score transformation, individually at each distance. Z-scores are used to calculate p-values. The p-values are adjusted by the FDR method. By default, they are adjusted at each distance. I suggest ignoring differences at higher distances, the cutoff is debatable, but typically it is the number of bins totaling 2-5Mb length, more distant interactions are well beyond the typical TAD size.

Thank you for the good feedback on the paper. Please, ask more if needed.

csijcs commented 4 years ago

I am using the multiHiCcompare now, which is really very good. I have also used TADcompare which is nice. Are there any plans for integrating the TADcompare in the mutliHiCcompare framework? Right now I have to merge the replicates to call TADs and perform differential TAD analysis. Would be great if I could do it all in one go without merging replicates.

mdozmorov commented 4 years ago

Yes, making our tools better integrated with each other is a priority. The position for such projects is open for postdocs and freelance bioinformaticians http://www.vcujobs.com/postings/98534

mdozmorov commented 4 years ago

Closing as resolved. Please, reopen if still have questions.