Can we supplement our ATAC models with DHS?

MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks

Apache License 2.0

25 stars 8 forks source link

Can we supplement our ATAC models with DHS? #10

Closed tacazares closed 3 years ago

tacazares commented 3 years ago

We want to test whether we can use DHS data to supplement our ATAC-seq data in our models.

We will use our 11 core TFs. We will split the ATAC-seq 50/50 with DHS data. We will also test a model of DHS only.

Tasks

[x] Download all BAM for ENCODE data of interest
[x] Filter for quality
[x] Normalize DHS signal tracks
[x] Average signal tracks
[x] Min-max normalize
[x] Generate run meta files (use slop20 peaks)
[x] Generate ROI
[x] Train model
[x] Select best models
[x] Predict
[x] Benchmark and Analyze

Updated task list based on https://github.com/MiraldiLab/maxATAC/issues/10#issuecomment-789143042

[x] Identifying the TFs with 4 cell types available
[x] Assess the quality of models trained with 2 of the 4 cell types available.
[x] Add an additional cell type for a total of 2 ATAC + 1 DHS and assess AUPR
[x] Repeat the above step for the 4th sample.

tacazares commented 3 years ago

In order to implement the DHS data we just replaced the ATAC-seq signal path with the DHS data path. The column name is still ATAC_signal. I have generated DHS data with different slop sizes around the 5prime cut site 0, 5, 10, 20, 40. The current test set is just 11 TFs with 6 cell types for training and GM12878 for testing.

tacazares commented 3 years ago

We can supplement our data with DHS data based on our results using 11TFs across 6CTs. There is a degradation in performance as you use more DHS based data. Screen Shot 2021-03-02 at 2 02 46 PM

We will continue to test whether DHS data will help us in cases where we have only two ATAC-seq data sets, but have additional cell types with DHS. We will test this set up by:

[x] Identifying the TFs with 4 cell types available
[x] Assess the quality of models trained with 2 of the 4 cell types available.
[x] Add an additional cell type for a total of 2 ATAC + 1 DHS and assess AUPR
[x] Repeat the above step for the 4th sample.

tacazares commented 3 years ago

I wanted to test this question again since in #24 we found the normalization method that we used might make it hard to translate our models to other data types. In the last results I showed no boost in performance and a steady decrease in AUPR as our model added more DHS data. I used the same set up to show that our models do not gain from having DHS data added to the training pool.

Old results

New results 20210616_11TF_GM12878_200bp_chr1_boxplot_comparing_DHS

It did not hurt model performance, so it might be useful for cases where we want to evaluate on a held out cell type. Based on the counts of ChIP-seq data, I do not think that we will gain much from adding DHS data to training or using it for evaluation other than being able to directly compare to the ENCODE DREAM challenge.

It looks like it did have an effect on our model selection process. This one example shows that there is more noise in the validation AUPR at each epoch. This trend was seen across most models.

20210614_ELF1_0DHS_AUPR_200BP_CHR2

20210614_ELF1_1DHS_AUPR_200BP_CHR2

20210614_ELF1_2DHS_AUPR_200BP_CHR2