Closed tacazares closed 3 years ago
In order to implement the DHS data we just replaced the ATAC-seq signal path with the DHS data path. The column name is still ATAC_signal
. I have generated DHS data with different slop sizes around the 5prime cut site 0, 5, 10, 20, 40
. The current test set is just 11 TFs with 6 cell types for training and GM12878 for testing.
We can supplement our data with DHS data based on our results using 11TFs across 6CTs. There is a degradation in performance as you use more DHS based data.
We will continue to test whether DHS data will help us in cases where we have only two ATAC-seq data sets, but have additional cell types with DHS. We will test this set up by:
I wanted to test this question again since in #24 we found the normalization method that we used might make it hard to translate our models to other data types. In the last results I showed no boost in performance and a steady decrease in AUPR as our model added more DHS data. I used the same set up to show that our models do not gain from having DHS data added to the training pool.
Old results
New results
It did not hurt model performance, so it might be useful for cases where we want to evaluate on a held out cell type. Based on the counts of ChIP-seq data, I do not think that we will gain much from adding DHS data to training or using it for evaluation other than being able to directly compare to the ENCODE DREAM challenge.
It looks like it did have an effect on our model selection process. This one example shows that there is more noise in the validation AUPR at each epoch. This trend was seen across most models.
We want to test whether we can use DHS data to supplement our ATAC-seq data in our models.
We will use our 11 core TFs. We will split the ATAC-seq 50/50 with DHS data. We will also test a model of DHS only.
Tasks
Updated task list based on https://github.com/MiraldiLab/maxATAC/issues/10#issuecomment-789143042