GeorgeCazenavette / mtt-distillation

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"
https://georgecazenavette.github.io/mtt-distillation/
Other
390 stars 54 forks source link

Where did you get the acc 36.1% from the paper Dataset Distillation with Infinitely Wide Convolutional Networks #10

Closed NiaLiu closed 2 years ago

NiaLiu commented 2 years ago

Thanks for your great idea and detailed work, and I hope you are enjoying your day so far.

I have a question regarding your paper "Dataset Distillation by Matching Training Trajectories". In the third sentence count from the bottom of the Introduction, you stated you break SOTA "Dataset Distillation with Infinitely Wide Convolutional Networks" on his accuracy of 36.1%/46.5%, However, the accuracy stated in the paper is actually 64.7%/80.6%.

Is that a small mistake? If it's not, could you help me to address where on the paper you find the accuracy? Thank you and best regards!

GeorgeCazenavette commented 2 years ago

Hi!

Thanks for pointing this out; I'll explain here.

So the paper you referenced, "Dataset Distillation by Matching Training Trajectories," does NOT report results on the same model as previous works (DC, DSA, DM).

Their Table 1 results (the ones you referenced) are obtained by training an infinitely-wide neural tangent kernel on their synthetic dataset.

In Table 2, they report results obtained by training standard (finitely-wide) neural networks on their synthetic datasets.

However, these results are still not on the same network used by previous (or our) work. The results in Table 2 are from a width-1024 network whereas ours and previous work use a width-128 network.

The numbers you cited from our introduction were obtained by evaluating the "best" publicly available synthetic sets from "Dataset Distillation by Matching Training Trajectories" on a 128-width network using their own public code. We included these numbers at submission time.

After contacting the authors, they noted that the synthetic sets that gave the best results for the NTK likely wouldn't also give the best results for the 128-width network and suggested we evaluate a large suite of their available synthetic sets.

So we then re-evaluated a large sweep of their publicly available synthetic sets that they saved over a wide variety of hyper-parameters (available in Sec A.3 of our paper).

After finding the best-performing synthetic sets from this large suite of hyper-parameters, we included these results in Table 2 of our paper. Unfortunately, I forgot to change the in-text numbers referenced at the end of the introduction, so thanks for pointing this out!

All that being said, all of the results reported in "Dataset Distillation by Matching Training Trajectories" use a slightly different architecture (they include an extra convolutional stem), so any results aren't quite a 1-to-1 comparison. Evaluating their synthetic sets on a 128-width network was the closest we could get.

This whole situation raises another question though. If we want to evaluate the cross-architecture performance of our synthetic sets, should we evaluate at every checkpoint? Or only the checkpoint that performed best on the baseline architecture (as we did in the paper)?

TLDR: The paper you referenced doesn't report results for the architecture used by DC, DSA, DM, and ourselves. We collected these results ourselves using their code and reported them in Table 2. The numbers you referenced in the introduction are from the synthetic sets that performed best on the NTK and should be updated to reflect the results in Table 2.

Let me know if you have any other questions!

NiaLiu commented 2 years ago

Really appreciate the detailed, well-structured explanation!

I really agree it is difficult to compare if they declare the results on different architectures. But I believe all readers who had struggled with this problem would find my post, as well as your great and detailed explanation.

And I do have one more question. I notice that Mr. Tongzhou Wang is the second author. Since he was the initiator of the idea dataset distillation and introduced AlexCifarnet for the original experiments. I'm wondering why AlexCifarNet was not conducted as part of the experiments. I really appreciate it if you could help me out with this problem.

Thank you and hope you have a great day!

GeorgeCazenavette commented 2 years ago

Yeah Tongzhou was also on the paper :)

We simply chose to use the ConvNet architecture so that we could more easily compare to the existing methods (DC, DSA, and DM) since this is the model they used in their experiments, and no other dataset distillation works since the original have used AlexCifarNet.

I hope you have a great day as well :)