Error in how area under the precision recall curve is calculated

Our current benchmarking code calculates AUPRc using sklearn and was worked on earlier this year https://github.com/MiraldiLab/maxATAC/issues/15#issue-842231427 . The fix was supposed to modify the sklearn AUPRc output so that AUPRc did not include the artificial point at recall=0 and precision=1 that have no corresponding threshold.

The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

We added the line below to remove the artificial point during AUPRc calculations.

https://github.com/MiraldiLab/maxATAC/blob/4bf2ae67d709237e015d6b22c3ad99e395d4b082/maxatac/utilities/benchmarking_tools.py#L251

However, when we write the precision, recall, and threshold values to a data frame we insert a 0 (in error) to account for the extra point which has no threshold. We should be removing the artificial points instead of inserting a 0.

https://github.com/MiraldiLab/maxATAC/blob/4bf2ae67d709237e015d6b22c3ad99e395d4b082/maxatac/utilities/benchmarking_tools.py#L247

I used the average training ChIP-seq signal to predict CTCF binding in GM12878 @ 1,000 bp resolution. There are 12 training cell types so at most a bin could have a score of 12. Bins with a 12 would have a threshold of 1. The worst case would be a 0 threshold.

Below is the dataframe that is generated using the incorrect method described above:

Precision	Recall	Threshold
0.021441	1.000000	0.000000
0.302754	0.954042	0.000000
0.436440	0.918547	0.090909
0.521597	0.886951	0.181818
0.595208	0.856176	0.272727
0.659878	0.824374	0.363636
0.724616	0.793599	0.454545
0.786655	0.759540	0.545455
0.841336	0.708248	0.636364
0.896601	0.644030	0.727273
0.940212	0.564629	0.818182
0.971072	0.399467	0.909091
1.000000	0.000000	1.000000

In the data frame above, the AUPRC was calculated correctly, but now the curve has the artificial point created by sklearn and the threshold is 1 instead of NaN. The row should be removed. When the 0 was added, it also shifted the thresholds up one position. The colored lineplots are therefore incorrect. The correct data frame should look like this:

Precision	Recall	Threshold
0.021441	1.000000	0.000000
0.302754	0.954042	0.090909
0.436440	0.918547	0.181818
0.521597	0.886951	0.272727
0.595208	0.856176	0.363636
0.659878	0.824374	0.454545
0.724616	0.793599	0.545455
0.786655	0.759540	0.636364
0.841336	0.708248	0.727273
0.896601	0.644030	0.818182
0.940212	0.564629	0.909091
0.971072	0.399467	1.000000

There are also other problems with larger bin sizes having a large recall at the first threshold. This is because many 1,000 bp bins have the same score. In this case, the AUC would only be calculated for the area directly under the curve that is visualized. We have in the past extended a line from the first point to directly to the y axis.

For example this is the current output: 27b52a2f-d855-4967-99c7-3f5aadd099bf

This curve is how we have corrected it in the past, but is not currently implemented: 221296d7-0c5e-4680-94e4-f2acfbe8bcac

We have also talked about not including the 100 percent recall point for average predictions and motif scanning because it inflates the AUPRc calculation by simply setting every position to 0. The point where recall = 1 corresponds to a threshold of0. That means that any position along the genome array that is greater than or equal to0` is used to predict TF binding. When we use motif scanning and the average signal predictions there is a straight line that is drawn from the last threshold P/R point and 100 recall with a threshold of 0 because there is low overall recall. This makes the model appear as if they are performing better. Example below:

c0c3cc66-de58-44ec-8fbb-6277ca982709

I will keep posting updates as I explore the AUPRc code.

MiraldiLab / maxATAC

Error in how area under the precision recall curve is calculated #54