evalclass / precrec

An R library for accurate and fast calculations of Precision-Recall and ROC curves
https://evalclass.github.io/precrec
GNU General Public License v3.0
45 stars 5 forks source link

How to draw AUC/ROC curve from True Positive Rate and False Positive Rate #15

Closed mostafiz67 closed 3 years ago

mostafiz67 commented 3 years ago

When I am reading the manual of the package, I am not getting any idea, how should I use the package for my dataset.

I have two models namely 2, and 3. I have 10 test datasets. I have applied different thresholds for each model and each dataset (8 thresholds for each test dataset). I also calculated the true positive rate, false-positive rate, etc for each test dataset.

Now, is it possible to draw AUC, ROC from my result dataset and using this package? Does this package specifically need scores and labels to draw the curves?

Sample Dataset: Here, Model: SP_length, Test Dataset Number: Test_dataset, Threshold: Prediction_Threshold, True Positive Rate: TRP_All, and False Positive Rate: FPR_All.

structure(list(SP_length = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), Test_dataset = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L), Prediction_Threshold = c(1.01590126290632, 1.11590126290632, 
1.21590126290632, 1.31590126290632, 1.41590126290632, 1.51590126290632, 
1.61590126290632, 1.71590126290632, 1.73978185992124, 1.83978185992124, 
1.93978185992124, 2.03978185992124, 2.13978185992124, 2.23978185992124, 
2.33978185992124, 1.01590126290632, 1.11590126290632, 1.21590126290632, 
1.31590126290632, 1.41590126290632, 1.51590126290632, 1.61590126290632, 
1.71590126290632, 1.81590126290632, 1.80215326487164, 1.90215326487164, 
2.00215326487164, 2.10215326487164, 2.20215326487164, 2.30215326487164, 
2.40215326487164, 1.01590126290632, 1.11590126290632, 1.21590126290632, 
1.31590126290632, 1.41590126290632, 1.51590126290632, 1.61590126290632, 
1.71590126290632, 1.81590126290632, 1.91590126290632, 1.73978185992124, 
1.83978185992124, 1.93978185992124, 2.03978185992124, 2.13978185992124, 
2.23978185992124, 2.33978185992124, 2.43978185992124, 2.53978185992124
), TPR_All = c(1, 1, 0.916372202591284, 0.273262661955241, 0.113074204946996, 
0.0577149587750294, 0.0188457008244994, 0.00471142520612485, 
1, 0.555555555555556, 0.333333333333333, 0.222222222222222, 0.111111111111111, 
0.111111111111111, 0, 1, 1, 0.910377358490566, 0.274764150943396, 
0.108490566037736, 0.0577830188679245, 0.0188679245283019, 0.00943396226415094, 
0.00117924528301887, 1, 0.444444444444444, 0.333333333333333, 
0.111111111111111, 0, 0, 0, 1, 1, 0.895610913404508, 0.230130486358244, 
0.107947805456702, 0.0557532621589561, 0.0166073546856465, 0.0118623962040332, 
0.00474495848161329, 0.00118623962040332, 1, 0.8, 0.5, 0.5, 0.3, 
0.2, 0.2, 0.2, 0.1), FPR_All = c(1, 0.999260901699926, 0.920177383592018, 
0.212860310421286, 0.0307957625030796, 0.00394185760039419, 0, 
0, 1, 0.871914609739827, 0.244162775183456, 0.0907271514342895, 
0.0433622414943296, 0.00733822548365577, 0.00333555703802535, 
1, 0.999266503667482, 0.896332518337408, 0.211735941320293, 0.0371638141809291, 
0.0039119804400978, 0, 0, 0, 1, 0.42235609103079, 0.171352074966533, 
0.0796519410977242, 0.0307898259705489, 0.0100401606425703, 0.00267737617135207, 
1, 0.99927728258251, 0.90966032281378, 0.215851602023609, 0.0298723199229101, 
0.00433630450493857, 0, 0, 0, 0, 1, 0.880108991825613, 0.335149863760218, 
0.0831062670299728, 0.0333787465940054, 0.0143051771117166, 0.00136239782016349, 
0, 0)), row.names = c(NA, 50L), class = "data.frame")

Thank you.

takayasaito commented 3 years ago

You need scores and labels to use precrec so that the library can prevent incorrect ROC and PRC calculations. You can simply create ROC curves and calculate AUC scores if you have already calculated TPRs and FPRs yourself. For example, you can use the ggplot2 package to create ROC plots and rollmean from the zoo package to calculate AUCs.

Although it is quite different from how precrec calculates curves and AUCs, I hope you can get the basic idea from it.

library(tibble)
library(dplyr)
library(ggplot2)
library(zoo)

# I copied your data frame to "df" first.
# df <- tibble(SP_length = c(2L, ....

# Add start points and two new columns (model & line ID)
# N.B. All ROC curves must include the origin (0, 0).
df <- bind_rows(df,
                tibble(SP_length = c(2L, 3L, 2L, 3L, 2L, 3L),
                       Test_dataset = c(1L, 1L, 2L, 2L, 3L, 3L),
                       Prediction_Threshold = c(0, 0, 0, 0, 0, 0),
                       TPR_All = c(0, 0, 0, 0, 0, 0),
                       FPR_All = c(0, 0, 0, 0, 0, 0))) %>%
  arrange(Test_dataset, SP_length, desc(TPR_All), desc(FPR_All)) %>%
  mutate(model = factor(SP_length),
         line_id = factor(SP_length * 10 + Test_dataset))

# ggplot
p1 <- ggplot(df, aes(x = FPR_All, y = TPR_All,
                     group = line_id, color = model)) +
  geom_line()
print(p1)

# ggplot with three grid cells
p2 <- ggplot(df, aes(x = FPR_All, y = TPR_All,
                     group = line_id, color = model)) +
  geom_line() +
  facet_grid(cols = vars(Test_dataset))
print(p2)

# AUC - calculate multiple trapezium areas
aucs <- df %>%
  arrange(Test_dataset, SP_length, TPR_All, FPR_All) %>%
  group_by(model, Test_dataset) %>%
  summarise(auc = sum(diff(FPR_All) * rollmean(TPR_All, 2))) %>%
  ungroup() %>%
  arrange(Test_dataset, model)
print(aucs)
mostafiz67 commented 3 years ago

@takayasaito Thank you very much for your kind response and suggestions. However, I was using the below code to draw the AUC.

ggplot(df, mapping = aes(x = FPR_All, y = TPR_All, color = method)) +
  geom_line(show.legend = FALSE) +
  facet_grid(method ~ Test_dataset,
             labeller = labeller(Test_dataset = function(x)paste0("Test Dataset ",x),
                                 method = function(x)paste0("Method ",x))) + 
  ggtitle("AUC Curve for Neighbor Based (Dataset 1: Disjoint)") +
  theme(plot.title = element_text(hjust = 0.5))

Now, is there any possible way to show the percentage of the area under the curve in my figure? I just want to show the % AUC in my plot. Something like this.

takayasaito commented 3 years ago

You can simply use geom_text or geom_label with an additional data frame.

# AUC
aucs <- df %>%
  arrange(Test_dataset, SP_length, TPR_All, FPR_All) %>%
  group_by(SP_length, Test_dataset) %>%
  summarise(auc = sum(diff(FPR_All) * rollmean(TPR_All, 2))) %>%
  ungroup() %>%
  arrange(Test_dataset, SP_length) %>%
  mutate(x = 0.7,
         y = 0.25,
         label = paste("AUC:", round(auc, 2))) %>%
  rename(method = SP_length)
print(aucs)

# ggplot with 6 grid cells
p3 <- ggplot(df %>% rename(method = SP_length),
             mapping = aes(x = FPR_All, y = TPR_All, color = method)) +
  geom_line(show.legend = FALSE) +
  geom_text(data = aucs, aes(x = x, y = y, label = label), color = "black") +
  facet_grid(
    method ~ Test_dataset,
    labeller = labeller(
      Test_dataset = function(x)
        paste0("Test Dataset ", x),
      method = function(x)
        paste0("Method ", x)
    )
  ) +
  ggtitle("AUC Curve for Neighbor Based (Dataset 1: Disjoint)") +
  theme(plot.title = element_text(hjust = 0.5))
print(p3)
mostafiz67 commented 3 years ago

Thank you very much.

mostafiz67 commented 3 years ago

@takayasaito I am extremely sorry to bother you again. But, I have another dataset and this dataset does not contain any Sp_length. So, I changed the SP_length with method. I think I am getting wrong (because of changing the code).

Code:

aucs <- df %>%
  arrange(Test_dataset, method, TPR_All, FPR_All) %>%
  group_by(method, Test_dataset) %>%
  summarise(auc = sum(diff(FPR_All) * rollmean(TPR_All, 2))) %>%
  ungroup() %>%
  arrange(Test_dataset, method) %>%
  mutate(x = 0.7,
         y = 0.25,
         label = paste("AUC:", round(auc, 2))) %>%
  rename(method_1 = method)
print(aucs)

The output I am getting

   method_1 Test_dataset      auc     x     y label    
   <chr>           <int>    <dbl> <dbl> <dbl> <chr>    
 1 AA                  1 0.00582    0.7  0.25 AUC: 0.01
 2 CN                  1 0.0108     0.7  0.25 AUC: 0.01
 3 Dice                1 0.0293     0.7  0.25 AUC: 0.03
 4 JAC                 1 0.0241     0.7  0.25 AUC: 0.02
 5 L3                  1 0.000610   0.7  0.25 AUC: 0   
 6 RA                  1 0.000140   0.7  0.25 AUC: 0   
 7 AA                  2 0.00960    0.7  0.25 AUC: 0.01
 8 CN                  2 0.0104     0.7  0.25 AUC: 0.01
 9 Dice                2 0.0287     0.7  0.25 AUC: 0.03
10 JAC                 2 0.0242     0.7  0.25 AUC: 0.02

But my actual AUC is showing that I should get mode AUC value for Dice and JAC methods. The figure link.

Sample Data

structure(list(method = c("CN", "CN", "CN", "CN", "CN", "CN", 
"CN", "CN", "CN", "CN", "AA", "AA", "AA", "AA", "AA", "AA", "AA", 
"AA", "AA", "AA", "JAC", "JAC", "JAC", "JAC", "JAC", "JAC", "JAC", 
"JAC", "JAC", "JAC", "L3", "L3", "L3", "L3", "L3", "L3", "L3", 
"L3", "L3", "L3", "Dice", "Dice", "Dice", "Dice", "Dice", "Dice", 
"Dice", "Dice", "Dice", "Dice", "RA", "RA", "RA", "RA", "RA", 
"RA", "RA", "RA", "RA", "RA", "CN", "CN", "CN", "CN", "CN", "CN", 
"CN", "CN", "CN", "CN", "AA", "AA", "AA", "AA", "AA", "AA", "AA", 
"AA", "AA", "AA", "JAC", "JAC", "JAC", "JAC", "JAC", "JAC", "JAC", 
"JAC", "JAC", "JAC", "L3", "L3", "L3", "L3", "L3", "L3", "L3", 
"L3", "L3", "L3", "Dice", "Dice", "Dice", "Dice", "Dice", "Dice", 
"Dice", "Dice", "Dice", "Dice", "RA", "RA", "RA", "RA", "RA", 
"RA", "RA", "RA", "RA", "RA"), Prediction_Threshold = c(0.1, 
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 
0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 
0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 
0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 
0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 
0.6, 0.7, 0.8, 0.9, 1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 
0.9, 1), Test_dataset = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    TPR_All = c(0.878103837471783, 0.77765237020316, 0.669300225733634, 
    0.625282167042889, 0.58803611738149, 0.576749435665914, 0.573363431151242, 
    0.423250564334086, 0.0756207674943567, 0.00112866817155756, 
    0.866817155756208, 0.712189616252822, 0.628668171557562, 
    0.589164785553047, 0.553047404063205, 0.108352144469526, 
    0.00225733634311512, 0.00225733634311512, 0.00225733634311512, 
    0.00112866817155756, 0.957110609480813, 0.920993227990971, 
    0.851015801354402, 0.785553047404063, 0.715575620767494, 
    0.644469525959368, 0.534988713318284, 0.13431151241535, 0.0090293453724605, 
    0.00790067720090293, 0.302483069977427, 0.0733634311512415, 
    0.0372460496613995, 0.0293453724604966, 0.0191873589164786, 
    0.00790067720090293, 0.00564334085778781, 0.00564334085778781, 
    0.00112866817155756, 0, 0.978555304740406, 0.948081264108352, 
    0.930022573363431, 0.891647855530474, 0.836343115124153, 
    0.760722347629797, 0.68510158013544, 0.591422121896163, 0.072234762979684, 
    0.00790067720090293, 0.734762979683973, 0.0248306997742664, 
    0.00790067720090293, 0.00564334085778781, 0.00451467268623025, 
    0.00225733634311512, 0.00225733634311512, 0.00225733634311512, 
    0.00225733634311512, 0.00112866817155756, 0.889887640449438, 
    0.775280898876405, 0.687640449438202, 0.61685393258427, 0.57752808988764, 
    0.560674157303371, 0.556179775280899, 0.546067415730337, 
    0.18314606741573, 0.00224719101123596, 0.90561797752809, 
    0.80561797752809, 0.68876404494382, 0.624719101123596, 0.585393258426966, 
    0.569662921348315, 0.543820224719101, 0.132584269662921, 
    0.00112359550561798, 0.00112359550561798, 0.966292134831461, 
    0.931460674157303, 0.865168539325843, 0.798876404494382, 
    0.719101123595506, 0.637078651685393, 0.543820224719101, 
    0.133707865168539, 0.00561797752808989, 0.00561797752808989, 
    0.331460674157303, 0.0707865168539326, 0.0438202247191011, 
    0.0292134831460674, 0.0146067415730337, 0.00786516853932584, 
    0.00337078651685393, 0.00112359550561798, 0, 0, 0.979775280898876, 
    0.961797752808989, 0.935955056179775, 0.902247191011236, 
    0.857303370786517, 0.773033707865169, 0.680898876404494, 
    0.59438202247191, 0.0584269662921348, 0.00561797752808989, 
    0.9, 0.719101123595506, 0.0617977528089888, 0.0247191011235955, 
    0.0112359550561798, 0.00561797752808989, 0.00337078651685393, 
    0.00224719101123596, 0.00112359550561798, 0.00112359550561798
    ), FPR_All = c(0.0133403448562177, 0.00259241832959693, 0.000156836696096611, 
    0, 0, 0, 0, 0, 0, 0, 0.00743590453258052, 0.000424381648261419, 
    9.22568800568302e-06, 0, 0, 0, 0, 0, 0, 0, 0.0288395007057651, 
    0.0202227081084572, 0.0127037723838255, 0.00748203297260893, 
    0.00341350456210272, 0.00122701650475584, 0.000424381648261419, 
    0.000258319264159125, 0.000202965136125027, 0.000175288072107977, 
    0.00414233391455168, 0.0010886311846706, 0.000765732104471691, 
    0.000405930272250053, 0.000221416512136393, 0.000138385320085245, 
    7.38055040454642e-05, 2.76770640170491e-05, 1.8451376011366e-05, 
    9.22568800568302e-06, 0.0342457538770954, 0.0278154493371343, 
    0.0216434640613324, 0.0165139815301726, 0.0113199191829731, 
    0.00575682931554621, 0.00204810273726163, 0.000636572472392129, 
    0.000221416512136393, 0.000175288072107977, 0.000369027520227321, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0128430762010487, 0.00236464228016498, 
    0.000198557748716144, 0, 0, 0, 0, 0, 0, 0, 0.0115163494255363, 
    0.00228341411023565, 0.000117329578786812, 0, 0, 0, 0, 0, 
    0, 0, 0.0285471890540528, 0.0197384452928276, 0.0122654536593291, 
    0.00708490148828058, 0.00357403947689059, 0.00122744790115434, 
    0.000469318315147249, 0.000261735214216735, 0.000171481692073033, 
    0.000153430987644293, 0.00391700286103665, 0.00106499156129568, 
    0.000631774655005912, 0.000297836623074215, 0.000180507044287403, 
    0.000108304226572442, 6.31774655005912e-05, 9.02535221437017e-06, 
    9.02535221437017e-06, 9.02535221437017e-06, 0.0332223215010966, 
    0.027355842561756, 0.0212908058736992, 0.0160831776460076, 
    0.0107672451917436, 0.00572207330391069, 0.00208485636151951, 
    0.000667876063863392, 0.000198557748716144, 0.000153430987644293, 
    0.0040523831442522, 0.000144405635429923, 9.02535221437017e-06, 
    0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 120L), class = "data.frame")
takayasaito commented 3 years ago

I don't know how you have calculated your TPRs and FPRs, but they are insufficient. You need to use one more lower bound and one more upper bound threshold values. Alternatively, you can manually add (FPR, TPR) = ((0, 0), (1, 1)) to your data frame in R. If you need to calculate accurate model performance metrics, it is easier to use some library, such as precrec.

Hope this helps.