Evaluation Questions - Githubissues

Hello,

First of all, thanks for publishing your work. I am using this in my thesis and have run into some issues regarding your frame-level evaluation.

In the method avg_scores_by_trans, you did: scores_by_trans[k] = score_mask[k].reshape(-1, num_transform) and then you calculate the mean over the second dimension. I guess here, you are trying to average over the number of transformations applied? My problem is: I think the results for the different transformations are concatenated one after the other, e.g. visible in the metadata array. With the reshape operation provided, you are essentially averaging over 5 consecutive frames, but not over the 5 transformations. This leads to the fact that the score changes if the order of the samples is changed, which should not happen. Did you maybe mean to do: scores_by_trans[k] = score_mask[k].reshape(num_transform, -1) scores_tavg[k] = scores_by_trans[k].mean(axis=1) ?
For your frame-level evaluation, you always take the maximum score in each frame. As far as I understand, the scores output by the model, however, are Normality scores and not abnormality scores. That means that a high score stands for a normal movement and a low score should stand for an abnormal sample since it does not fit the modell, right? But now if we want to calculate the anomaly score of a frame, shouldn't we consequently take the minimum score of all the people in the frame? If we take the maximum value, we will take a score representing a normal person. Or is that actually the goal?

3.Have you ever done a more detailed evaluation, e.g. on sample-level or pixel-level? Because I don't know if I am making a mistake in my evaluation, but I feel like the the model is not very good at actually detecting

In your postprocessing of the model scores, in the method scores_align, you do a shift operation and calculate Gaussian smoothing over the scores. What is the purpose of this operation?

I would very much appreciate a short answer, since this is important for my thesis. Thanks in advance!

I agree with points 1-3.

2) It's even more confusing: In the here provided annotations from the STC dataset, anomaly frames are labeled 1 instead of 0, even though (as you said) the scores reflect normality. So one would have to use negative scores or swap the labels. Thus, the model seems to perform very well on this data set, even though it evaluates exactly the wrong way around. Strangely, in a test of mine, the AUROC does not improve when evaluating the right way. Also, it doesn't make sense to score frames with no person with 0 when the scores are given in log-likelihood.

I wonder if the same evaluation script was used for NTU.

4) I guess they do this shift operation so that the score relates to the center of the window? I haven't checked if it works that way with the code though. The smoothing is probably used to make the score a bit more robust and stable against noise on the scores. However, it's a bit strange that it's smoothed over all concatenated frames in the dataset, rather than smoothed per clip individually. Thus, it is smoothed over the boundaries of the clips.

I would also appreciate an answer and thank you in advance.

amirmk89 / gepc

Evaluation Questions #10