dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
181 stars 61 forks source link

max rank outside comprehensive list #929

Closed silil closed 11 months ago

silil commented 1 year ago

Problem

Rankers were taking a long time to compute. While the ranker is normalizing the ranks, the code calculates the max rank of the ranks list within a comprehensive list. This means that on each element on the list, we are calculating the max -which is a constant value!- this leads to run expensive operations that can take a long time to compute when working with big matrices.

Solution

By the point in which we were calculating the max rank, we already have a sorted list. The solution was to take out of the comprehensive list the max operation and instead get the last element of the sorted list, assign it to a variable, and then use it within the comprehensive list. This solution reduces the time it takes for the ranker.

Tests

For a matrix of shape (615447, 4768), the code "as is" took 95.79 min, with the change in this request it took 0.0019 sec.

Added 3 new testing functions to the pytest suit where we check that the numpy arrays generated by the change are equal to the numpy arrays of scores generated by the previous logic implemented now on a predict_proba_deprecated function within the test. The test are specific for no ties, half ties and some ties in the ranking.