Bedgraph output - Githubissues

Politropos commented 4 years ago

I have tried to use the bedgraph output from HICHiP-peaks to run in diff-FITHICHIP that runs EdgeR to populate, compare and find differentials in 1D bins for CHIp coverage. I think the bdg format is not compatible as I get no differentials . However, if I use bedgraphs inferred by FITHICHIP or HICHIpper peak finding utilities both of which use MACS2 on the same sample it works fine. If I visualize HICHIP-peaks and FITHICHIPpeaks bedgraph files on a browser I see similar pattern of enriched regions. Can you advise on how to get this to work with the HICHIP-peak bedgraph output Thank you

ChenfuShi commented 4 years ago

Hello, I am not really familiar with diff-fithichip, but I'm just having a look at the code right now. There's two things that come into my mind immediately looking at their code. 1) are you using the bam files for the other methods you are describing?

2) if you are using bedgraphs from other softwares, are you sure the differential peaks are actually where you expect them to be? even visually?

The reason is that if you look at their code here: https://github.com/ay-lab/FitHiChIP/blob/master/Imp_Scripts/DiffAnalysisHiChIP.r lines 777-795

if (tools::file_ext(ChIPAlignFileList[i]) == "gz") {
        # gzipped bedgraph format
        if (i == 1) {           
            system(paste("zcat", ChIPAlignFileList[i], "| bedtools coverage -a", TargetBinnedChrFile, "-b stdin -counts >", MergedChIPCovFile))
        } else {
            system(paste("zcat", ChIPAlignFileList[i], "| bedtools coverage -a", TargetBinnedChrFile, "-b stdin -counts | cut -f4 >", tempfile1))
            system(paste("paste", MergedChIPCovFile, tempfile1, ">", tempfile2))
            system(paste("mv", tempfile2, MergedChIPCovFile))           
        }
    } else {
        # either BAM file or plain bedgraph format
        if (i == 1) {           
            system(paste("bedtools coverage -a", TargetBinnedChrFile, "-b", ChIPAlignFileList[i], "-counts >", MergedChIPCovFile))
        } else {            
            system(paste("bedtools coverage -a", TargetBinnedChrFile, "-b", ChIPAlignFileList[i], "-counts | cut -f4 >", tempfile1))
            system(paste("paste", MergedChIPCovFile, tempfile1, ">", tempfile2))
            system(paste("mv", tempfile2, MergedChIPCovFile))
        }       
    }

You can see that they are using bedtools coverage both the BAM file input and the bedgraph input, which is not the right tool for this operation. When using bedtools coverage -counts it will just count the number of tags in the region overlapping that bin, which is ok for BAM files, but obviously bedgraphs will already have the pileup and then the value on the 4th column. If it works at all with other bedgraphs it's because I assume that their bins are not the same in the two files, which results in having different number of counts in the bins from the two samples. In HiChIP-Peaks the bins in the bedgraph are the same in all files, this means that you get all the same "coverage" using the bedtools coverage option, which is what I can see from their temporary files.

So if you want it to run that script with the ChIP signal from HiChIP-peaks (or any other bedgraph in that regard), you should edit their code to not use bedtools coverage. As a solution I would propose to either split the bedgraph into 1bp bins using bedtools genomecov, and then doing an average every binsize bp (5k or 10k), or you can use the right tool for this analysis which is bedops bedmap. That one is the specific tool to take a bedgraph and get the average signal over a bed feature. https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html#bedmap

Let me know if this satisfies your question! Best Chenfu

baishengjun commented 1 year ago

Hi, Does the genrated bedgraph file was normalized ?

Thanks, Bai

ChenfuShi commented 1 year ago

Hi, The bedgraph file is not normalized between samples.

baishengjun commented 1 year ago

Hi, Thanks for your quick reply, and I also have an another question. The diff_peak commond seems return the normalized data, but the deseq2 software and other differential analysis tools usually accept count matrix (not normalized).

Thanks, Bai

ChenfuShi commented 1 year ago

the counts are not normalised between samples, but rather they have been previously corrected. You still have to use deseq2 normalisation to corrected between samples. For DESeq2 you'll have to round the numbers to integers in R.

baishengjun commented 1 year ago

Thanks again.

ChenfuShi / HiChIP_peaks

Bedgraph output #15