MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
25 stars 8 forks source link

Can we supplement our ATAC models with DHS? #10

Closed tacazares closed 3 years ago

tacazares commented 3 years ago

We want to test whether we can use DHS data to supplement our ATAC-seq data in our models.

We will use our 11 core TFs. We will split the ATAC-seq 50/50 with DHS data. We will also test a model of DHS only.

Tasks

Updated task list based on https://github.com/MiraldiLab/maxATAC/issues/10#issuecomment-789143042

tacazares commented 3 years ago

In order to implement the DHS data we just replaced the ATAC-seq signal path with the DHS data path. The column name is still ATAC_signal. I have generated DHS data with different slop sizes around the 5prime cut site 0, 5, 10, 20, 40. The current test set is just 11 TFs with 6 cell types for training and GM12878 for testing.

tacazares commented 3 years ago

We can supplement our data with DHS data based on our results using 11TFs across 6CTs. There is a degradation in performance as you use more DHS based data. Screen Shot 2021-03-02 at 2 02 46 PM

We will continue to test whether DHS data will help us in cases where we have only two ATAC-seq data sets, but have additional cell types with DHS. We will test this set up by:

tacazares commented 3 years ago

I wanted to test this question again since in #24 we found the normalization method that we used might make it hard to translate our models to other data types. In the last results I showed no boost in performance and a steady decrease in AUPR as our model added more DHS data. I used the same set up to show that our models do not gain from having DHS data added to the training pool.

Old results image

New results 20210616_11TF_GM12878_200bp_chr1_boxplot_comparing_DHS

It did not hurt model performance, so it might be useful for cases where we want to evaluate on a held out cell type. Based on the counts of ChIP-seq data, I do not think that we will gain much from adding DHS data to training or using it for evaluation other than being able to directly compare to the ENCODE DREAM challenge.

It looks like it did have an effect on our model selection process. This one example shows that there is more noise in the validation AUPR at each epoch. This trend was seen across most models.

20210614_ELF1_0DHS_AUPR_200BP_CHR2

20210614_ELF1_1DHS_AUPR_200BP_CHR2

20210614_ELF1_2DHS_AUPR_200BP_CHR2