Some small issues with reproducing the published results

Thank you for the interesting paper and for making your source code available!

I tried to replicate your results on the CIFAR-100 network and noticed some issues:

loadtxt doesn't seem to support \n as a delimiter - do I need to install some additional packages to make this work? For now, I changed the delimiter there, and here (also, I had to remove the final trailing delimiter)
For the places365 benchmark, you state that you randomly sample 10,000 images from the original test dataset. Do you have a script for this with a fixed seed? Otherwise, reproducing the exact results won't be possible
When running python3 ood_eval.py --in-dataset CIFAR-100 [--p 90], I get the following results:

No sparsity		FPR	AUROC
SVHN	87.64	81.83	86.29
LSUN	14.83	97.43	97.62
LSUN_resize	75.52	77.76	79.35
iSUN	78.77	76.78	79.92
dtd	84.49	71.04	76.53
places365	78.33	77.95	78.22
AVG	69.93	80.46	82.99

90% sparsity		FPR	AUROC
SVHN	59.20	88.60	90.35
LSUN	0.91	99.74	99.74
LSUN_resize	54.87	88.27	89.30
iSUN	52.35	88.53	90.12
dtd	61.42	77.13	79.36
places365	80.36	77.09	77.48
AVG	51.52	86.56	87.73

These scores are close to those you report in Table 9, but not exactly the same. Is there some randomness involved? Especially LSUN_resize is off by quite a bit. Also, you state that you report standard deviations across 5 independent runs, but only do so for DICE, why is this the case?

Edit: Also, could you add a license to your code, so we can build upon it in future work?

Edit2: I'm unable to replicate all rows in Table 9 other than MSP, Energy and DICE. For Odin and Mahalanobis, there's a flag I can set to use this technique, but it requires some config values that I don't have. For the others, I don't know how to run them at all.

deeplearning-wisc / dice

Some small issues with reproducing the published results #2