decile-team / cords

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.
https://cords.readthedocs.io/en/latest/
MIT License
316 stars 53 forks source link

Questions about accuracy logging #31

Closed Janghyun1230 closed 3 years ago

Janghyun1230 commented 3 years ago

Hello! Thanks for your great work.

I'm currently working on this code and I want to ask a question about accuracy logging.

https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L530-L541

In line 541 of train.py, val_acc contains cumulative accuracies over input batches. For example, if the loader contains 4500 examples and the batch size is 1000, then tst_acc has 5 accuracies per each evaluation. (the first element of tst_acc will be the accuracy over the first 1000 examples)

https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L631-L633

In line 633, it prints the best value in tst_acc. In this case, the resulted best accuracies over different algorithms and seeds might be the values evaluated on different test samples.

Is this what you intended? In my experience, I think evaluating algorithms on an identical test dataset is a convention. In addition, is the reported test accuracies in the GRAD-MATCH paper the best values as above or the last test accuracy?

Best, Jang-Hyun

krishnatejakk commented 3 years ago

Hello,

Thanks for pointing out the issue where the accuracy is stored for each batch. This was an issue, we missed when we are changing the GRAD-MATCH code to the CORDS repository. I fixed the code and updated it.

In GRAD-MATCH paper, we report the mean last test accuracy on entire test dataset for five runs. We use the default test datasets if they are available, otherwise we use a random split of the original dataset. Since most of the standard datasets like CIFAR10, CIFAR100, MNIST have default test datasets, the reported test accuracy for different algorithms is actually reported on the same samples.

However, the last test accuracy appended will be on the entire dataset based on the logic as the tst_correct attributes are not reset after every batch. In print_args, we actually print the last test accuracy which will be the accuracy on the entire dataset.

https://github.com/decile-team/cords/blob/22dfb3bf102510ac86f6bd5c17c0169fc5581e8c/train.py#L557-L584