I have some questions. Why is the initial value of the parameter tau for EMA set so small (0.1) in the code, while the paper uses (0.9998 or 0.9999)?
Furthermore, why is the code using SGD optimizer instead of the Adam optimizer mentioned in the paper?
Are these differences due to the size of the training dataset?
I have some questions. Why is the initial value of the parameter tau for EMA set so small (0.1) in the code, while the paper uses (0.9998 or 0.9999)?
Furthermore, why is the code using SGD optimizer instead of the Adam optimizer mentioned in the paper? Are these differences due to the size of the training dataset?