broadinstitute / ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
MIT License
206 stars 63 forks source link

ChIP-seq input correction and questions #14

Closed josruirod closed 4 years ago

josruirod commented 5 years ago

Hi, Thank you very much for your work. Looks very interesting! So I was trying to use my own data when I have realized that in our experiment, we have an input ChIP-seq file to correct the H3K27ac one. In the function run.neighborhoods.py, for the parameter -H3K27ac, I guess the file is supposed to be already input-corrected? Or there is any other way of providing the input file and the software will perform correction?

Thanks!

jnasser3 commented 5 years ago

Hi, thanks for your interest. The codebase does not support input-correcting a ChIP-Seq bam file, you would need to do this on your own.

The --H3K27ac flag accepts both .bam and .bigwig inputs. So you could pass an input-corrected bigwig as the H3K27ac parameter

josruirod commented 5 years ago

Oh great thank you. I must have missed that.

I'm encountering another problem though. When I'm providing an input-corrected bigwig file to -H3K27ac, I'm getting the error: "NameError: name 'open_bigwig' is not defined"

Thanks for your support

EDIT: The line error in neighborhoods.py was line 500. I added the line "from pyBigWig import open as open_bigwig" before and it seems to run, but now the log shows the error: IndexError: index 0 is out of bounds for axis 0 with size 0

I have observed the "GeneList.TSS1kb.bed" file is empty. I'm using data from another organism... not sure if using other organism is supported and this might be the cause

jnasser3 commented 5 years ago

Epigenetic data from another organism is supported as long as the reference files are also from this organism.

Are the --genes, --whitelist, --blacklist, --chrom_sizes files all from your organism? The chromosome names in these files must also match those in the bigwig. The error you are seeing may be a chromosome name mismatch.

josruirod commented 5 years ago

Hi, so I sorted out the problem in my case. After the +-500 operation from the TSS, the "GeneList.TSS1kb.bed" had some negative numbers in the start coordinate column, because in my organisms there are few genes very close to the chromosome endings. Then, in neighborhoods.py, when sorting this file there was an error (not printed), the file became empty and the counting steps failed. Removing these conflictive genes from the file provided solved the issue and everything is working, thanks!

Just a couple of final questions to clarify:

Thanks for your time!

jnasser3 commented 5 years ago

Glad you figured this out. Thanks for pointing out the corner case with respect to gene TSS within 500bp of the chromosome boundaries - I will look to correct this in the future.

Re Hi-C: The model requires some estimate of the Contact frequency between putative enhancers and gene TSS. Hi-C is the best method we have for measuring this genome-wide - so this is what we currently use. Is there Hi-C available in your organism? (or some other experiment which measures physical proximity for the genes and enhancers you are interested in?)

We also show that (in human) Hi-C contact frequency can be very well approximated by a power-law function of genomic distance. Assuming this is also the case in your organism, you can just use the power-law relationship to estimate Contact instead of experimentally measured Hi-C. This would require passing the code the exponent of the powerlaw (--hic_gamma). The current code computes this model anyway, but I think it will fail if Hi-C is not provided. With some modifications to the code, it would be possible to compute a powerlaw only version.

Re expression: Yes, activity_base is a property of the enhancer only (is is the geometric mean of DHS and H3K27ac read counts in the enhancer region).

The expression_table argument is only used to select which genes the code will make predictions for (default to TPM > 1). Passing the expression_table is optional. See the 'Gene Expression' section of the README for more details

josruirod commented 4 years ago

Thanks for the support! Regarding the corner case of 500bp within chromosome boundaries, at least an error should be clearly reported, so the user knows which genes/TSS remove from the original provided list.

Regarding Hi-C, to our knowledge there is no Hi-C or measurement of physical proximity in our organism. Not sure the power-law relationship to estimate contact would work well in our organism, but we would definitely try the power-law only version of the function. Now indeed it seems that is failing without Hi-C.

jnasser3 commented 4 years ago

Hello,

Release v0.2.1 of the codebase addresses both these comments.

The code will now check if extending TSS 500bp exceeds chromosomal bounds (as specified by --chrom_sizes). In such cases the extension will be clipped to the chromosomal bounds. So it should not be necessary to remove any genes from the list.

The code has also been modified to run without Hi-C data. In such cases it will compute the powerlaw model using the exponent provided by --hic_gamma_reference. Since the ABC Score is not computed in this case, you will need to set the --score_column parameter accordingly (--score_column powerlaw.Score).

josruirod commented 4 years ago

Great news, we will test the power-law only version of the function.

Thanks for the support

josruirod commented 4 years ago

Hi there So at last we want to test the power-law only version of the function. Sorry it took so long. I don't find it on the Readme, and I'm not sure I fully understand your last comment. I can run the ABC pipeline until this step where the HIC was required.

Can you provide an example on how to run the predict.py without HI-C data to get a gene assignation for the regions?

Thank you very much for the support

jnasser3 commented 4 years ago

Hi,

I just updated the README with some info on this. Basically you should remove the --HiCdir, --hic_resolution and --scale_hic_using_powerlaw arguments from predict.py and include the --score_column powerlaw.Score and --hic_gamma_reference arguments in predict.py.

Example command:

python src/predict.py \
--enhancers example_chr22/ABC_output/Neighborhoods/EnhancerList.txt \
--genes example_chr22/ABC_output/Neighborhoods/GeneList.txt \
--threshold .02 \
--cellType K562 \
--outdir example_chr22/ABC_output/Predictions/ \
--make_all_putative \
--hic_gamma_reference .87 \
--score_column powerlaw.Score

--hic_gamma_reference will be the powerlaw exponent (.87 is the default value fit in K562 cells). When running in this mode, the ABC.Score column will be set to NaN. The powerlaw.Score column should be used.

Let me know if this works for you.

josruirod commented 4 years ago

Great, thank you very much. Will let you know when we test it. Any recommendation on the powerlaw exponet to use or how to determine it? Thanks

jnasser3 commented 4 years ago

Ideally the exponent should be derived from a Hi-C experiment in your organism. Is there any Hi-C in your organism (even from a different cell type or tissue from the one you are interested in?)

If not, then an exponent near ~1 is likely appropriate. Hi-C experiments in a range of organisms (human, mouse, yeast, drosophila, neurospora, refs 8-12 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6457944/) result in exponents in the range of .8 - 1.2.

josruirod commented 4 years ago

There is no Hi-C in our organism, but those refs are extremely helpful. Thank you very much. Will keep you posted if we need anything else. Regards

josruirod commented 4 years ago

Hi, sorry I didn't answer. I can confirm this approach without Hi-C data worked in our case. Thanks for the support