Test Set missing files causing variation in published research

Hi, Dan and the other contributors: Thank you for maintaining the repo so far! AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound. We did a recent study: paper and code, where we found the recent research papers are having a whooping +- 5% difference in performance due to test set missing files when downloading. Plus, the difference in label quality also contributed to the performance variation and making it less fair (see figure 2 in our paper). I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels. Looking forward to your thoughts.

Thanks for the message.

It’s not immediately obvious to me where the 5% number comes from, or how you know it’s due to differences in missing files. Can you walk me through it?

It strikes me that mAP is possibly more vulnerable than d-prime. We track d-prime and have been surprised how consistent the overall results have been as the data has eroded by >10% over time.

Thanks,

DAn.

On Wed, Apr 6, 2022 at 13:02 Billy @.***> wrote:

Hi, Dan and the other contributors: Thank you for maintaining the repo so far! AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound. We did a recent study: paper https://arxiv.org/abs/2203.13448 and code https://github.com/lijuncheng16/AudioTaggingDoneRight, where we found the recent research papers are having a whooping +- 5% difference in performance due to test set missing files when downloading. Plus, the difference in label quality also contributed to the performance variation and making it less fair (see figure 2 in our paper). I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels. Looking forward to your thoughts.

— Reply to this email directly, view it on GitHub https://github.com/audioset/ontology/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEGZUP2YNCN3HRZOULZCBTVDW7SPANCNFSM5SWVVQ5Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hi, Dan: Thank you again for your prompt response! In our Paper: In table 1, due to the downloading difference of AudioSet, number of train&test set varies a whopping ±5% across previous works.

e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123. |19185 - 20123| / 19185 = 0.049 that's where the 5% comes from.
Not to mention the ERANN< Verbitskiy et al>'s test size was only 17967.

We use the exact setup of training pipeline of AST and result in a 2.5% mAP drop by using our test set VS. using theirs.

Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper. e.g. one could have downloaded the lower-label-quality test samples that tank their score. or e.g. one could test only on some high-label-quality samples and report higher mAP.

Oh, I misunderstood, I thought 5% was the difference in metric, not the test set size. Yes, we’ve seen variations of >10% in available dataset sizes, but there’s not much we can do since videos get taken down all the time; from the beginning we tried to choose videos with a lower chance of disappearing, but that wasn’t very successful.

I wouldn’t assume that it’s differences in available videos that’s the main factor in result variation, there are many other things at play. We’ve had a very hard time matching published results, even reproducing our own past results - sometimes it appears to be subtle changes in the underlying DNN package (with different releases) or arithmetic differences in different accelerator hardware.

It would be very interesting to measure this directly, e.g. delete dataset entries at random and see how that affects the resulting metric. If you’re only looking at the impact of changes in the evaluation set, that could be very quick, since you only need to apply the ablation in the final step before summarizing the results across all the eval set items.

DAn.

On Wed, Apr 6, 2022 at 13:58 Billy @.***> wrote:

Hi, Dan: Thank you again for your prompt response! In our Paper https://arxiv.org/pdf/2203.13448.pdf: In table 1, due to the downloading difference of AudioSet, number of train&test set varies a whopping ±5% across previous works.

e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123. |19185 - 20123| / 19185 = 0.049 that's where the 5% comes from.

Not to mention the ERANN< Verbitskiy et al>'s test size was only

We use the exact setup of training pipeline of AST and result in a 2.5% mAP drop by using our test set VS. using theirs.

Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper. e.g. one could have downloaded the lower-label-quality test samples that tank their score. or e.g. one could test only on some high-label-quality samples and report higher mAP.

— Reply to this email directly, view it on GitHub https://github.com/audioset/ontology/issues/8#issuecomment-1090567419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEGZUPJFWOWK7FQZ2N57BDVDXGDPANCNFSM5SWVVQ5Q . You are receiving this because you commented.Message ID: @.***>

Hi, Dan: Thanks again for your reply. I notice there could be minor performance fluctuations due to pytorch/tensorflow or hardware changes, but those are not very significant. Yes, we did that ablation already, this is shown in our paper mentioned above: mAP _ 0 35 not listed, A1_ 128x1024, pretrain; A2_ 128x1024 w_o pretrain; A3_ 64x400 pretrain; A4_ 64x400 w_o pretrain; Basically, this graph is showing the performance of models at test time changing rapidly on the different label-quality-quantile test set.

e.g. you can see CNN+trans model using our A2 recipe (using 128x 1024 feature, and the scheduling described in our paper) can get to 0.526 test mAP if we only report on (>90% quality classes in test set, ablation all low quality classes), VS. if we report on all test set (100% of our 20123 test files), the same model gets 0.437 mAP. I find this gap between 0.526 and 0.437 significant.

All the models listed here are SOTA results I have reproduced/implemented. If you are interested, you can try running my pipeline here

Again, I appreciate your correspondence a lot here! I feel trying to solve it here on GitHub could be faster and maybe easier than going for a chat at Interspeech or ICASSP, of course, I would love to do so if you are going to attend.

As an audio ML researcher, I always feel AudioSet could be the ImageNet for the audio community, given you guys already spent lots of effort and resources collecting it. Look at ImageNet, folks spent time reporting 0.1 performance gain on Top-1 on Top-5 acc. change, (surely, there are lots of black-box/incremental research...) but point is ImageNet is more and more established as the large-scale de-facto standard for the vision community, and that community benefited from that.

That's why I strongly feel a fair comparison on AudioSet could be helpful and is actually a poignant task. Hopefully, you get where I am coming from.

Subsetting classes is definitely going to have a large influence on the summary score, because there's such a wide spread in per-class performances (some classes are legitimately just more prone to confusion; some have scarce training data, although this seems to matter less than I expect). Here's a scatter of per-class mAP (for a basic resnet50 model) vs. the QA quality estimate across all 527 classes:

Your figure 2, showing that the average over subsets of these points (growing from the right, I guess) yields different overall averages, seems natural given such a wide spread.

In practice, of course, multi-label makes it impossible to select all the positive samples for one class without including some positives from multiple other classes, but you could drastically alter the prior of different classes. But note that wouldn't actually help you "goose" your results, unless you had some threshold that classes with too few samples were excluded from the final balanced average. Absent that, reducing but not eliminating classes would add noise to their contribution to the balanced average, but wouldn't weaken it, since balanced average treats each class's value equally regardless of the number of eval samples its based on.

But the missing 1000+ segments in the smaller downloaded eval sets aren't going to be concentrated in a few classes. They should occur at random across all the segments, and to impact all classes equally, in expectation.

I tested the impact of random deletions by taking multiple random subsets of the eval set with different amounts of deletion, then calculating the mean and SD of the metrics vs. the proportion of the eval set being selected. So, proportion = 1.0 is the full-set metric, and shows no variance because every draw has to be the same. As the proportion drops, we expect the variance to go up because the different draws can be increasingly unlike one another. Here's the result for d-prime (i.e., transformed AUC) which is our preferred within-class performance metric. The shaded region represents +/- 1 SD away from the mean, over 100 draws per proportion:

We see that the average across 100 draws is approximately constant across all proportions, but the spread grows for smaller sets. However, even for proportion=0.7 (30% deletion), it's still within about 0.008 of the full-set figure. I normally ignore differences in d-prime smaller than 0.02 or so (since we see variations on that scale just across different trainings or checkpoints), so the erosion of the dataset doesn't seem to be adding serious noise here.

Here's the same treatment for mAP:

Now the spread across random samples at 70% is around 0.004, so again not a huge effect in mAP, where changes below 0.01 aren't really worth paying too much attention to. However, the means of the different proportions appear to have a definite trend, rather than being estimates of the same underlying value. This is not at all what I expected, and I can't explain it off-hand, but maybe it's another reason not to use mAP (the big reason being that mAP is conflated with the priors for each class in your particular eval set, whereas ROC-curve metrics normalize that out). But, even so, the bias due to the smaller set is only about another 0.005 at 70% deletion.

So my belief is that random deletions from the eval set (which primarily occur because videos get deleted from YouTube) is not as serious a threat to metric repeatability as I feared at first. (In 2017, the sets were disappearing at ~1% per month, but that seems to have slowed down). I hope these plots reassure you too. I think there must be a different factor causing the difference you saw in AST results.

(To the curious, the comment I deleted was alerting me that my first attempt to upload the discussion of eval set erosion was missing the figures).

audioset / ontology

Test Set missing files causing variation in published research #8