[WIP] Improve metrics in DetectionTask.

Specifically, report precision/recall in addition to F1, and support reporting the F1 at multiple score thresholds.

We really need to figure out how to log dict returned by one metric though, right now it is convoluted (creating many classes) and inefficient (each class is basically repeating the same computation but just returning a different metric).

I think that can go in later PR but this one should have a test. So WIP for now.

allenai / rslearn

[WIP] Improve metrics in DetectionTask. #80