broadinstitute / ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
MIT License
195 stars 57 forks source link

Question on use of H3K27ac HiChIP data #16

Closed DoaneAS closed 3 years ago

DoaneAS commented 4 years ago

Hi, Thanks so much for making this code available and usable!

In your NG manuscript, you look at some different options for using H3K27ac HiChIP, as well as including quantitative DHS signal. It appears that ABCsqrt(DHS x H3K27ac HiChIP) and ABC(DHS x H3K27ac HiChIP) performed best.

I have ATAC, H3K27ac ChIP and HiChIP in a lymphoma cell line and primary cells, and used your ABC model, substituting HiChIP in place of HiC. Would you recommend the above mentioned approaches, rather than simply substituting HiC with HiChIP as I did? Do you have plans to implement the above varieties of your ABC model in the published code?

thanks a lot. If you think it's worth using the HiChIP varieties of your model, I may look into this further and happy to share that code.

Best, Ashley

jnasser3 commented 4 years ago

Hi Ashley,

Thanks for reaching out.

Do you have H3K27ac Hi-ChIP or Hi-ChIP for another factor?

My initial impression is that I wouldn't recommend loading a Hi-ChIP file using '--hic_type juicebox' (three column sparse matrix format). The code used to process this type of file is designed for Hi-C datasets. It makes various assumptions unique to Hi-C datasets (such as assuming that the matrix has constant row and column sums) and has special handling for diagonal elements. It's possible that this processing is okay for HiChIP datasets as well, but I’m not confident about this.

I think our future plans for Hi-ChIP will depend on how good Hi-ChIP is as a predictor of enhancer-gene regulation. Based on the experimental data in our publication, it seems that having H3K27ac Hi-ChIP performs about as well as having H3K27ac ChIP-Seq and Hi-C separately. It will be interesting to see how this evolves as more experimental CRISPR data is generated.

DoaneAS commented 4 years ago

Hi, Thanks for this helpful information. The HiChIP I am using for this is H3K27ac. As far as the contact matrix is concerned, the HiChIP .hic files are processed in the same way as HiC (normalization factors, etc), starting from the validPairs output from HiC-Pro. I think the assumptions your code makes about that shouldn't be a problem.

Since H3K27ac HiChIP is designed to detect interactions from any loci with H3K27ac mark, I think theoretically it should would work to measure strength of contacts between DNA elements in enhancers and promoters. For the cells I'm working with, we recently generated HiC data, and also did some CRISPRi targeting of DNA elements in super-enhancers. I'm interested in comparing HiC and HiChIP for this and will let you know what we find out.

Best, Ashley

jnasser3 commented 4 years ago

Hi Ashley,

This sounds great. Definitely very interested to hear about how the various prediction methods compare to the CRISPRi data!

A couple of other comments concerning using a H3K27ac Hi-ChIP dataset with the ABC code. Plugging in the Hi-ChIP dataset into the --HiCdir argument will:

(1) The ABC Score column will be computed as ~ sqrt(DHS x H3K27ac) (H3K27ac Hi-ChIP). So this may be counting H3K27ac 'twice'. For our paper the Hi-ChIP predictor we used was DHS (H3K27ac Hi-ChIP) - the code will not compute this, but it can be computed from the output files

(2) The H3K27ac Hi-ChIP data will be stored under column names that look like 'hic_contact*'

Lastly, the newest release of the codebase (v0.2.1) supports the upload of Hi-C files in .bedpe format. This is an alternative to using 'juicebox three column format'.

Looking forward to see what the results look like, Joe

nevelsk90 commented 4 years ago

Dear Joe,

Thanks for making the tool. I am also trying to use H3K27 HiChip data with predict.py function and I just can't make it work. The function works without HiChip data so the problem is definitely in the bedpe file . The error I get is in the attached .txt predict.py_error.txt

Can you please explain how exactly .bedpe files should be used, i.e. file formatting, naming, specific arguments to predict.py function etc.? Below are the first three lines from my .bedpe file:

chr1 3123052 3127143 chr1 3243788 3247885 . 1 chr1 3127659 3131044 chr1 3147167 3148478 . 1 chr1 3127659 3131044 chr1 3431861 3435042 . 3

Thank you in advance, Tim

jnasser3 commented 4 years ago

Hi Tim,

There were a couple of inconsistencies in the codebase surrounding how bedpe data should be formatted. I've just pushed a fix for these.

I think the way you are running it is correct and should work now. The command and file format you are using looks right. Let me know if you're still having issues with the latest release to master.

Best Joe

nevelsk90 commented 4 years ago

Thanks for the fast reply, the program now works with bedpe files!

Just a small comment, the script will throw an error and stop execution if bedpe file doesn't have data for some chromosome, e.g. I didn't have data for chrM and got

FileNotFoundError: [Errno 2] No such file or directory: '/home/tim_nevelsk/PROJECTS/PODOCYTE/HiCdata/chrM/chrM.bedpe.gz'

The problem is non-essential because one can run analysis per chromosome but it would be nice to fix this bug.

Thanks and Regards Tim

jnasser3 commented 4 years ago

Yeah, it will fail. In the meantime you can run by chromosome as you describe, or you can remove the chrM entries from Neighborhoods/{EnhancerList.txt,GeneList.txt} to skip making predictions on this chromosome