Some questions for positive and negative training dataset

XiaoTaoWang / NeoLoopFinder

A computation framework for genome-wide detection of enhancer-hijacking events from chromatin interaction data in re-arranged genomes

Other

53 stars 16 forks source link

Xiaotao, 1.) In your NC peakachu paper, Fig.1 mentioned GM12878 is used for the training dataset. But in the Supplementary Data 1, GM12878, K562, H1ESC, and mESC are used for positive training set. Could you confirm here? In my understanding, cell line GM12878 should have some difference to cell line K562 when used for positive training dataset. 2.) Again, in the NeoloopFinder, whether you used the same cell line for positive training dataset? In Method section, you mentioned that " ...trained Peakachu models on both situ and dilution Hi-C map of the GM12878". Whether only the GM12878 cell line is used, but considering the in-situ and dilution at several possible resolutions? 3.) For the negative dataset, I know the random selection with same window 11x11 should be ok. In my understanding, if the diversity of negative training set is not balanced, it will affect the general performance. Whether they are form the same cell line (GM12878)? Whether peakachu module consider the balance among different resolution?
Thank you in advance

Hi,

Regarding your first two questions, I want to clarify that for each training, peakachu is essentially learning the potential contact patterns that can be used to predict chromatin loops genome wide. Therefore, you don't need to train different models for different cell lines, as contact patterns of chromatin loops in different cell lines should be similar. As a proof of concept, in the main figure 5 of our original peakachu paper, we showed that the loop predictions by models trained in different cell lines are comparable. So in practice, we released our pre-trained models in GM12878 at different sequencing depths (https://github.com/tariks/peakachu#using-peakachu-as-a-standard-loop-caller) so that users can choose an appropriate model to predict loops in any other Hi-C contact maps of any cell lines at similar sequencing depth.

About your third question, NeoLoopFinder internally trained models using GM12878 Hi-C at different resolutions, and for each training, the negative set was randomly selected from the same GM12878 Hi-C at the same resolution for sure. We did test different scenarios for sampling negative sets. And the strategy we described in the original peakachu paper achieved the best performance.

Best, Xiaotao

XiaoTaoWang / NeoLoopFinder

Some questions for positive and negative training dataset #18