GuanLab / Leopard

16 stars 5 forks source link

Query on cut off for true TF binding from prediciton #10

Closed Al-Murphy closed 3 years ago

Al-Murphy commented 3 years ago

Hi,

I was wondering if the cut-off for a model's predicted transcription factor binding is 0.5 when using Leopard?

I think this is dealt with in the score_record function in auc.py but it isn't clear to me exactly what it is doing. Is there any chance you could clarify firstly, what this function does and secondly, what the cut-off to mark a positive and negative prediction from Leopard's continuous predicted score?

Thanks, Alan.

Hongyang449 commented 3 years ago

Hi Alan,

Leopard only provide the continuous predicted scores without any cutoff for the positive/negative binarization. You can select the cutoff based on your needs - it is always the tradeoff among precision, recall, and false positive rate.

Therefore, instead of using an arbitrary cutoff to calculate the precision/recall/others, the two functions in auc.py are used to calculate the areas under the precision-recall curve and the ROC curve. The score_record function first divides [0-1] into e.g. 10^5 = 10,000 bins if the argument input_digits = 5. Then for each bin, calculate the true positives/false positives/others based on predicted values that are within that bin. Then the outputs are used in the calculate_auc function to calculate the areas under two curves.

You can also check out the Table_S4 and Table_S5, which list Leopard's performance at multiple False Positive Rate(FPS) and recall cutoffs.

Thanks, Hongyang

Al-Murphy commented 3 years ago

That makes sense, thanks!