Closed Genie-Kim closed 3 years ago
The "test-domain validation" metric is limited to one query per hyper-parameter (in fact, the last epoch) in order to avoid extremely optimistic results. Therefore, it could return worse-than-optimal results. If we allowed the "test-domain validation" metric to peek at all epochs for all hyper-parameters, it would almost surely report better results than "training-domain validation" in all cases.
Thanks for making a great library. However, while I was measuring the performance of the algorithm with sweep.py, I found that the test-domain validation set came out lower than the training-domain validation set. It's not just my algorithm, it's common in the existing sweep data. I think this is not normal, Any idea why this is happening?