evaluation of stitcher - Githubissues

keighrim commented 9 months ago

Because

We want to see if the stitcher/smoothing added via https://github.com/clamsproject/app-swt-detection/issues/33 is doing well, independent from the accuracy of image-level classification model.

Done when

Controlled evaluation is done to measure the effectiveness of the stitcher. The evaluation at high-level, should be measuring the performance difference between raw image classification results and re-constructed image classification results from TimeFrame annotations.

Additional context

Original idea of having this evaluated was proposed by @owencking in his email on 12/15/2023. Here's an excerpt from it.

Suppose we have a set of still frames F from a full video. And suppose we have human/gold labels of all those frames. Suppose we have a CV model M and a stitching algorithm S. Then a full SWT app is the composite M+S and can take a video as input.

Generate two sets of image classification predictions for all the frames in F:

The first prediction set P1 just uses M (choosing the label with max score across the output categories) to predict one label for each frame.

The second prediction set P2 is generated by using M+S on the original video to generate time-based annotations, then producing a label prediction for each frame in F according to the label in the annotation for the time period containing the frame.

Compare both P1 and P2 to the gold labels of the frames in F. Then we can evaluate how good S is according to how much P2 improves over P1. This gives us a way of evaluating different stitching algorithms (somewhat) independently of the CV model. So it will allow us to tell whether performance improvements for time-based eval metrics are coming from the image classifier or the stitching algorithm.

keighrim commented 9 months ago

Given the new output format that's being discussed in https://github.com/clamsproject/app-swt-detection/issues/41, the evaluation plan is as follow

run SWT on "unseen" video with dense annotation, using the same sample rate that used in the annotation
- the model deployed via #63 was trained using fold-size of 2, meaning that is only two videos that are unseen to the model (possibly 1)
- hence we might need to re-train a model that has a larger fold size so that we will have more videos to evaluate against
get the output MMIF with TimePoints and TimeFrames.
iterate through targets list and compare the frameType value of the TimeFrame and label value of the target TimePoint, collect pairs that are different
using timePoint value of the TimePoint annotations in the collected "disagreeing" pairs, look for the gold label and judge which one is correct, count scores (correct for 1)
normalize (somehow) the counted scores and return as the evaluation result

marcverhagen commented 9 months ago

Done pretty much as described above with one difference. Trying to mimic the sample rate was impossible since the app at the moment only accept milliseconds and the rate used for the annotation was using some number of frames I think. So I just used a frame for the annotation that was within at most 32ms.

clamsproject / app-swt-detection

evaluation of stitcher #61

Because

Done when

Additional context