Slightly different results using example data

sarmapar commented 5 years ago

Hello, I am attempting to replicate the results using the chromosome 22 example to verify that the code is working properly before applying it to my own data. I am running the code on a linux-based cluster with slightly different versions of the necessary packages. I am using anaconda to load the necessary packages, and I am using the following versions:

Anaconda: 4.3.0
Python: 3.6.4
samtools: 1.9
bedtools: 2.29
Macs: 2.1.2
Java: 10.0.2
Juicer Tools: 1.11.09

Python packages:
pyranges: 0.0.63
numpy: 1.17.3
pandas: 0.25.3
scipy: 1.3.1
pyBigWig: 0.3.13

I am not getting any warnings or errors, and I am getting results very similar to the results in the example folder, but my re-calculated results are consistently lower than the example which results in less enhancer-gene pairs, since the cutoff is 0.2. I have included a scatterplot of the ABC scores from the enhancer-gene pairs that both the example and my results have in common, and the IGV visualization of these results below. The mean absolute difference between each ABC score pair was about 0.0026.

compABCscores

Here is a visual comparison, looking at the EnhancerPreductions.bedpe output from the example_chr22/ABC_output/Predictions included here and the EnhancerPreductions.bedpe output from running the code with the exact specifications as in the README file.

Since the results are well correlated, I don't think this is a big issue, but I am not sure why these differences are occurring. It could be some rounding differences on the cluster, or some differences in the versions of the packages I am using, or the fact that I am running the packages through anaconda. Any thoughts?

Best regards

jnasser3 commented 5 years ago

Hello, thanks for writing in.

It would be useful to understand why this is occurring. Could you check the following:

Are the Neighborhoods/EnhancerList.txt files the same in the example vs recalculated (specifically, do they have the same number of rows and is the activity_base column the same?)
Can you make the scatter plot using the 'hic_contact_pl_scaled_adj' column in Predictions/EnhancerPredictionsFull.txt?

These checks will let us know whether the discrepancy is due to peak calling, read counting or something Hi-C related. I suspect it could be related to 0 vs NaN in Hi-C matrices, but we should rule out everything else first.

sarmapar commented 5 years ago

The recalculated Neighborhoods/EnhancerList.txt has 230 more lines than the example, but the ones that are in common seem to have the same activity_base values.

The example and recalculated hic_contact_pl_scaled_adj values are identical, which can be seen in the following scatter plot.

HiC

I also just found the following warnings after the macs2 callpeak output,

ABC-Enhancer-Gene-Prediction-master/src/hic.py:136: RuntimeWarning: invalid value encountered in less
norms[norms < kr_cutoff] = np.nan

ABC-Enhancer-Gene-Prediction-master/src/hic.py:137: RuntimeWarning: invalid value encountered in greater_equal
norms[norms >= kr_cutoff] = 1

which seem to have to do with numpy trying to deal with NaN values.

I've attached the recalculated EnhancerList.txt if that is helpful. EnhancerListRecalculated.txt

jnasser3 commented 5 years ago

Thank you for this analysis and the file. The difference in number of rows of EnhancerList.txt and the equivalence of the Hi-C data suggest this issue is not related to Hi-C.

I realize now that the script I used to generate the results in example_chr22/ had a minor difference to the example commands in the README. I will fix this for the next release of the codebase. Making this change explains the vast majority of the discrepancy between the file in the repository and EnhancerListRecalculated.txt.

It's not exactly the same though. There are still a small set of (~20) enhancer regions with slightly different bounds. I'd still like to understand these discrepancies, will require some further digging - I will get back to you.

jnasser3 commented 4 years ago

I was unable to completely reproduce EnhancerListRecalculated.txt, even after using the version of MACS2 that you used.

I have however fixed the issue whereby the example data in example_chr22/ and the example commands on the README were slightly out of sync.

broadinstitute / ABC-Enhancer-Gene-Prediction

Slightly different results using example data #12