GuanLab / Leopard

16 stars 5 forks source link

Prediction result only ranging from [0.4,0.5] #12

Open mingyanisa opened 2 years ago

mingyanisa commented 2 years ago

Hi! I have tried to train the Leopard model only on the DNA seq of reference genome and remove the DNase-seq / delta DNase out from the input feature. However, the prediction result only gives the value ranging from [0.4,0.5] and cannot capture any peak while having a high AUPRC score. Has anyone ever experienced this issue?

yang-dongxu commented 2 years ago

Hello, I have met the same problem here. How do you solve it?

yuanfangguan commented 2 years ago

We will see if the leading author give a different comment. but let me give my perspective here.

when you are using dna sequence alone, this information alone is not supposed to tell if a TF binds or not. therefore, a good model should not give extreme large or small values as there is not sufficient confidence.

i am surprised auprc is high, i don't think so-- as the baseline is so low due to extremely limited number of positive example. i think only auroc would be high in this case

yang-dongxu commented 2 years ago

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

Hongyang449 commented 2 years ago

Hi, the model based on DNA only (without DNase-seq) will not be very informative - for the same TF, it can not distinguish different binding profiles in different cell types. As Yuanfang mentioned, the model will be more "conservative" in predictions and the values could be around 0.5. The key information of DNase-seq is missing to generate high-confident predictions - that's also why e.g. traditional motif-based models have many false positive peaks. The AUPRC/AUROC scores could be high, even if the values are ranging from [0.4, 0.5]. This is because the AUPRC/AUROC scores are determined by the ranking of predictions, instead of the absolute values. For example, consider a simple task of predicting four positions, predictions (A) = 0.1, 0.3, 0.9, 0.5 and predictions (B) = 0.48, 0.49, 0.51, 0.50. These two predictions (A) and (B) have the same AUPRC/AUROC scores. For this specific TF-binding task, the percentage of binding sites (the AUPRC baseline), is very low so that the AUPRC scores are typically low.

Hongyang449 commented 2 years ago

Sorry, I realize it's due to the bigwig I used to train: I just use the signal bigwig directly but not peak. It will be helpful to add the content on how to generate bigwig for training to the readme. Thank you :>

To generate peak bigwig files, usually you need two steps: (1) call peaks using whatever software and (2) convert peak files into bigwig format. Once you have the peak values, I convert it into bigwig using some in-house codes. You can check out lines 54-72 in this code for your reference. Thank you!