LeeDoYup / FixMatch-pytorch

Unofficial Pytorch code for "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence" in NeurIPS'20. This repo contains reproduced checkpoints.
MIT License
190 stars 35 forks source link

Training time #8

Closed hkunzhe closed 3 years ago

hkunzhe commented 3 years ago

Thanks for providing the well-documented code! It seems that every 1000 iterations taking about 5-6 mins (a single NVIDIA 2080Ti GPU). As for MixMatch, I used code here, and every 1000 iterations only take 1 min.

I agree that consistency regularization based SSL methods take much long time to train the model. The fundamental reason is that FixMatch does not use external dataset and a pretraining model, but makes hidden representations from the unlabeled data of downstream task (CIFAR10). In addition, FixMatch requires 2^20 iterations, which are much much longer than those of supervised learning (150 epochs with batch_size = 128, it is about 60,000 iterations).

In fact, MixMatch also uses consistency regularization and the training iterations are the same as FixMatch. What do you think caused the slow training of FixMatch compared to MixMatch?

LeeDoYup commented 3 years ago

Let's talk about this issue. FixMatch uses WRN-28-2 (about 1.5 M params), and 960 samples at each iteration. (64 labeled data, 64x7 unlabeled data with weak augmentations, 64x7 unlabeled data with strong augmentations).

What is the number of total mini-batch of MixMatch per iteration?

LeeDoYup commented 3 years ago

Oh, i have checked the code. The code use 128 (64 labeled data, 64 unlabeled data) at each iteration. It is quite reasonable speed because FixMatch (5-6 mins) use 7.5 x more samples than MixMatch (1 min) per each iteration.

hkunzhe commented 3 years ago

Thanks for your quick reply! I see the uratio parameter in your code represents $\mu$ in the paper. And in section B.5 Ratio of Labeled to Unlabeled Data in Minibatch, We can find setting $\mu$ to 8 is enough to achieve a small error rate?

LeeDoYup commented 3 years ago

@hkunzhe In Figure 3 (a) of the original paper, you can find the ablation study about the unlabeled data ratio. The authors describe that $\mu$=8 shows the smallest error with or without learning rate scaling.

hkunzhe commented 3 years ago

@LeeDoYup Thanks for your patient reply! According to the ablation study about the unlabeled data ratio with different learning rate scaling strategies, I experimented with $\mu=1$ (64 labeled updates, 64 unlabeled updates). Since $\eta=0.03$ when $\mu=7$ (64 labeled updates and 64x7 unlabeled updates), I set $\eta=0.01$ (should be 0.075 exactly. Is it correct?) when $\mu=1$. I achieve a better result than MixMatch with a similar setting. The training time and GPU memory are also similar ($\mu=1$).

LeeDoYup commented 3 years ago

Thanks for the experiments. I want to check that the training time and GPU memory are similar to MixMatch setting or FixMatch ($\mu=7$).

LeeDoYup commented 3 years ago

Moreover, if you share the training setting (which dataset, the number of labeled data), the information would help other people see the ablation experiment !