clamsproject / aapb-evaluations

Collection of evaluation codebases
Apache License 2.0
0 stars 1 forks source link

SR point-wise evaluation with measuring "stitcher" performance #60

Open keighrim opened 1 month ago

keighrim commented 1 month ago

New Feature Summary

55 added an evaluation software for apps like SWT only using TimePoint annotations, but that evaluation can easily expanded to evaluation of "stitcher" component that turns TP annotations into TimeFrame annotations. The idea is based on that all existing stitcher implementation is using "label remapping", exposed as a runtime param and recorded in the view metadata, that enables us to re-construct point-wise but remapped label value list.

So the idea is to update eval.py file so that it

  1. Reads TP annotation as usual, constructing a list of "raw" classification results, let's call it raw
  2. Reads the gold files as usual, constructing a list of gold labels, let's call it gold
  3. Grabs the "latest" view with TimeFrame annotation with the remapper config (map for SWT built-in, labelMap for simple-stitcher)
  4. Uses the remapper to map the raw and gold into secondary remapped lists (these new lists should be shorter then the original ones, since not all of the raw/gold labels are remapped into the secondary (TF) labels. Let's call them raw-remap and gold-remap respectively.
  5. Then iterates through the TF annotations, construct a third list of stitched, remapped labels by using the pointers in targets prop (which must be pointing to TP annotations so the timepoints can be traced), let's call this list stitched
  6. compute P/R/F between
    • raw vs. gold (this should already be there in the current eval.py)
    • raw-remap vs. gold-remap
    • stitched vs. gold-remap,

Related

resolving this issue will also properly address https://github.com/clamsproject/app-swt-detection/issues/61

Alternatives

No response

Additional context

No response

marcverhagen commented 1 month ago

Some of this was done in 22bee5c (work in progress that I just pushed up to make it visible), but not quite in the way as described above. One difference is that it uses the output of the updated process.py script from the annotations repository.

marcverhagen commented 1 month ago

Is there overlap here with https://github.com/clamsproject/aapb-evaluations/issues/43?

keighrim commented 1 month ago

The evaluation scheme in this is based on point-wise evaluation, hence is not compatibly with "old" interval-level gold data (from 2020 time). I don't think this is a duplicate of #43. Eventually, I believe the proposed method will be evaluation for stitcher components, largely independent from image classification model performance.

I think the most "overlapping" effort to this issue was evaluate.py[^1] in https://github.com/clamsproject/app-swt-detection/commit/5590a3e975afa9008687a8533aa167bebb35dc9f , but that file was

  1. never used to produce a public report or associated with any actual (archived) SWT MMIF output files
  2. written a fair while ago and the current status/compatibility is unknown

so I thought it's not easy to verify if the old code works or not. (plus, this repo is the repo for evaluation code) That led me to start this new issue.

[^1]: which wasn't easy to dig out, since the most closely related issue - presumably https://github.com/clamsproject/app-swt-detection/issues/61 - was closed without mentioning the file or commit or an explicit PR merge.

keighrim commented 1 month ago

Since the existing timepoint evaluation and the proposed timeframe (stitcher) evaluation can be applied to any point-wise classification tasks, I think it'd be more representing if we rename the subdirectory name to pointclassification_eval (or something like that).

kla7 commented 2 weeks ago

To analyze stitcher's performance, I compared the evaluation scores from filtered (corresponding to raw-remap vs. gold-remap) and stitched (corresponding to stitched vs. gold-remap). To optimize the stitched scores, we needed to perform gridsearch on the stitcher. I compiled all results and analyzed the bar charts generated by a new see_results.py script using @keighrim's gridsearch efforts with the following grid:

minTFDuration = {1000, 2000, 3000, 4000, 5000}
minTPScore = {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5}
minTFScore = {0.001, 0.01, 0.1, 0.5}
labelMapPreset = {swt-v4-4way, swt-v4-6way}

For bars, every configuration produced perfect 1.0 scores across all metrics (F1, Precision, Recall). Here is a sample chart from the output visualization for bars[^1]:

image

Across all labels (besides bars), I noticed that minTPScore tended to produce higher scores when its value is higher compared. Because of this, I focused on the subplot where minTPScore = 0.5 so that I can observe differences between the other three parameters.

Aside from higher minTPScore values, higher minTFDuration values also result in better scores. This makes sense because both of these parameters determine how many TPs and how many TFs are allowed to be included at all, and higher values for each means TPs with lower scores and TFs with shorter duration are excluded, leading to a better chance of success.

labelMapPreset=swt-v4-6way tends to result in lower scores than when labelMapPreset=swt-v4-4way, which makes sense because TFs might be misrecognized as other_opening or other_text when these additional labels are available as options to choose from, creating more room for error.

When minTFScore={0.001, 0.01, 0.1} and minTFDuration & labelMapPreset are fixed, the results between corresponding minTPScore values are identical. minTFScore=0.5 tends to result in a slightly higher score compared to when minTFScore is a smaller value.

credits see the most improvement comparing the results from filtered to those of stitched. In particular, Recall scores were already fairly high to begin with but there is an increase of at least 0.02 across all configs. When only considering results where labelMapPreset=swt-v4-4way, F1 scores see an increase of around 0.13 across all configs and Precision sees an increase of 0.2 across all configs. Here is a sample chart from the output visualization for credits, where the metrics are the highest among all configurations[^2]:

image

From all of my observations, I have concluded that the best scores result from the following configuration:

minTFDuration = 5000
minTPScore = 0.5
minTFScore = 0.5
labelMapPreset = swt-v4-4way

[^1]: Please note that F, P, and R correspond to average F1, Precision, and Recall scores retrieved from the evaluation script aggregating scores across all GUIDs in the given dataset and they are independent of each other. [^2]: When observing the ideal configuration highlighted at the bottom of this comment, the Precision score is actually slightly lower when minTFDuration=4000 for credits, but the difference is practically negligible (0.002).

keighrim commented 1 week ago

For future references, the "result" files used for this gridsearch/evaluation; swt6.1-stitcher3.0-results.zip

keighrim commented 1 week ago

A few follow up questions;

And future directions;