Removing outliers question

eijynagai commented 7 years ago

Hello,

I started working on Hi-C data available from GEO and my question is how to overcome the extremely large values produced when I run "hicCorrectMatrix correct"? I found the explanation in the closed issue #113. But, even ranging more stringent filtering of bins (from +-2.5 to +-0.5), I still having the message: ERROR:iterative correction:*Error* matrix correction produced extremely large values. This is often caused by bins of low counts. Use a more stringent filtering of bins.

I followed the HiCExplorer tutorial and could get results for 10 kb bins, but now I want to compare with 40 kb and 100 kb bins and got stuck on this point. Do you have any suggestion?

Thank you in advance.

Eijy

joachimwolff commented 7 years ago

Dear Eijy,

as explained in issue #113 the cause for extreme large values are very low counts or zeros. To overcome this issue I recommend two solutions:

The number of counts per bin is for your data to low.
- Have a look at the diagnostic plot which you can create with hicCorrectMatrix diagnostic_plot --chromosomes chr1 chr2 --matrix matrix.h5 --plotName diagnostic_plot.png. Does it look like a gaussian distribution? Try to select threshold values according to this plot.
- Try to increase the values per bin.
- It would be perfect to have replicates. If this is the case, use hicSumMatrices --matrices one.h5 two.h5 three.h5 --outFileName replicatesMatrix.h5
- If you don't have replicates the only chance to increase the number of counts per bin is to decrease the resolution. Use hicMergeMatrixBins --matrix input.h5 --numBins 100 --outFileName mergedMatrix.h5 to merge matrix bins together and increase the count per bin. The output resolution depends on the input resolution and the parameter --numBins. If you had e.g. a matrix with 10 kb and you use numBins with 100, you will get a 1 Mb resolution. Alternatively it helps to rebuild the matrix itself with a lower resolution from scratch.
You can specify the areas to include for the correction with the --chromosomes parameter. Plot the matrix with hicPlotMatrix to see if a chromosome is having no or only low counts and is maybe causing this issue.

I hope I could help you. Don't hesitate to ask again for more explanation.

Best,

Joachim

P.S. We are updating the documentation for the next release. Maybe a look at the updated tutorial helps you too: https://github.com/joachimwolff/HiCExplorer/blob/develop/docs/content/mES-HiC_analysis.rst

eijynagai commented 7 years ago

Hello Joachim,

Thank you very much for your clear explanation. I followed your advice and checked the diagnostic plots. Some of them didn't look exactly like a gaussian distribution because they show some edges in one or two samples, I guess is because I merged the replicates from different restriction enzyme (Ncol + HindIII), is it ok?

The problem was all my samples had low or zero counts for Chromosome Y. Could you tell me why all of them have very low or zero interaction? Have you seen this in your data too? I downloaded my data from 4 different cells and distinct experiments.

Another thing now is that I'm getting zero-size files using findTAD option.

To get the scores, I used:

for res in 10 40 100; do
    if [ "$res" -eq 10 ]; then
        min_depth=30000  #30000
        max_depth=100000 #3000000
        step=10000   #10000
        echo $min_depth $max_depth $step
    elif [ "$res" -eq 40 ]; then
        min_depth=120000 
        max_depth=400000
        step=40000
        echo $min_depth $max_depth $step
    elif [ "$res" -eq 100 ]; then
        min_depth=300000
        max_depth=3000000
        step=100000
        echo $min_depth $max_depth $step
    fi

    mkdir -p TADs/TAD_${res}k
    for i in list; do echo "hicFindTADs TAD_score \
        -m hiCmatrix_${res}k/${i}_merged_corrected.res_${res}kb.h5 \
        --minDepth ${min_depth} --maxDepth ${max_depth} --step ${step} \
        --numberOfProcessors 10 \
        -o TADs/TAD_${res}k/${i}_TAD_${res}kb " > run_hicFindTADs_score_${i}_${res}kb.sh
        qsub -cwd -v PATH=$PATH -l s_vmem=100G,mem_req=100G -e log -o log \
        run_hicFindTADs_score_${i}_${res}kb.sh
    done
done

Can you help me with the parameters? Are they ok to run the tool using 10, 40, 100 kb of bins?

Thank you very much in advance.

Best regards,

Eijy

joachimwolff commented 7 years ago

Dear Eijy,

you should be very careful using different restriction enzymes, different restriction enzymes will lead to different interactions, different cut sites, different read lengths and most important: different biases. It is difficult to correct this and mixing the data generated by different restriction enzymes will probably lead to less accuracy and significant results. Lieberman-Aiden et al. showed this in Comprehensive mapping of long range interactions reveals folding principles of the human genome', 2009:

nihms-194459-f0001

In B, C and D they show an interaction matrix with HindIII and NcoI, all three with different results. For your special case right now I think it does not matter that much because I think you just play around with the data and the tools. You do not have to redo the computation, I know it is always time consuming. (If I am wrong, my apology for my 'play around'. In this case please do not use the mixed matrix!) For a future usage you should handle the data of different restriction enzymes as separated experiments, do all the analysis in parallel and in the end you can compare them. Unfortunately the choice of the restriction enzymes can lead to different results and it is always better to have something to compare to.

Concerning the missing data of the Y chromosome: I have seen that too, I think the most likely explanation is that the Y chromosome was not part of the Hi-C analysis in the wet lab. Why the researches did that I can not answer by certainty, I assume the Y chromosome was left out because it is sex dependent and by disregarding it makes the analysis easier.

I am not surprised that you did not found any TADs with hicFindTADs. Our software uses the Bonferroni correction which is very strict, obviously too strict. In the next release you will have more options, the false discovery rate and 'none' to not use any multiple hypothesis testing. You can try to increase the p-value, sometimes this helps. With a p-value of 1 you will get all possible TAD regions, they do not make that much sense, but it is nice to see how it works in general. Or you can ignore the '0' results and still use the output for the plot, only the black lines indicating the TAD regions are missing, the TADs are then the diamonds.

I hope I could help.

Best,

Joachim

eijynagai commented 7 years ago

Hi Joachim,

Thank you very much for clarifying.

you should be very careful using different restriction enzymes, different restriction enzymes will lead to different interactions, different cut sites, different read lengths and most important: different biases.

Yes, I understand about adding more bias to my analysis. So, since it's only one of the 4 cell samples that has the distinct restriction enzymes, I decided to make 3 analyses in paralle: Ncol, HindIII and Ncol+HindIII.

I have seen that too, I think the most likely explanation is that the Y chromosome was not part of the Hi-C analysis in the wet lab. Why the researches did that I can not answer by certainty, I assume the Y chromosome was left out because it is sex dependent and by disregarding it makes the analysis easier.

And about the Y chromosome and the strict correction also totally make sense. I will continue with the analysis playing with different parameters to compare with an previous one using a different tool. And looking forward for the next version too.

Thank you again for all the support.

Best regards,

Eijy

deeptools / HiCExplorer

Removing outliers question #117