MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
25 stars 8 forks source link

What is the fourth column in the bed file the variants function produce? #111

Closed songzeji closed 1 year ago

songzeji commented 2 years ago

Hi, I'm using the variant function to study the TF binding properties of some eQTLs. I wonder how should I interpret the score recorded as the fourth column in the bed file the variants function produces?

tacazares commented 2 years ago

Hello @songzeji, The fourth column in that file should represent the maxATAC score for that interval. There should also be a .bw file produced that has the same signal. These values should be between 0 and 1.

You should interpret each TF model independently using the information in the maxatac_data directory. This can also be found here: https://github.com/MiraldiLab/maxATAC_data. You can use the transcription factor model validation data to help choose a threshold that is geared for your study.

For example, if you want to use the IRF3 model look in this directory: https://github.com/MiraldiLab/maxATAC_data/tree/main/models/IRF3

There should be 3 files. 1) a .h5 file. This is the model file. 2) a file with the ending validationPerformance_vs_thresholdCalibration.png. This is a figure of the validation curves and their performance at different cutoffs. 3) a file with the ending validationPerformance_vs_thresholdCalibration.tsv. This tsv file has the information from the figures in 2.

Using the .tsv file, you should choose whether you want to prioritize recall or precision. Say you want to choose a threshold that is roughly equivalent to 90% precision (column: Monotonic_Avg_Precision). This would be equivalent to the a Standard_Thresh of ~.96 according to the calibration file: https://github.com/MiraldiLab/maxATAC_data/blob/main/models/IRF3/IRF3_validationPerformance_vs_thresholdCalibration.tsv#L966 Screen Shot 2022-09-05 at 9 10 59 PM

So for every row or interval in your output file that has a threshold above that value, you can label it as bound. There are many ways to interpret the output of these models, but our approach tries to tie the thresholds back to performance on the validation data set.

songzeji commented 2 years ago

Hi @tacazares,

Thank you so much for your clear and detailed explanation. It really helps!