Comparison on PACS and VLCS

vihari commented 4 years ago

In table 1 and table 2 of your paper, you show performance of DeepAll along with performance of the related work. DeepAll is the baseline number that should have been the same across different methods had the dataset, implementation are standardized, is that correct? My question is: how are you comfortable making comparisons with methods from different implementations when they have such diverging baseline numbers? I mean, how can you be sure that the improvements are from better implementation or better generalization? One could compare improvements over DeepAll as indicative for domain generalization but deltas over baseline need not be linear, that is it might be harder to push DeepAll when it is already doing well. I am at loss trying to make sense of PACS and VLCS evaluations. What am I missing?

Thanks

silvia1993 commented 4 years ago

"DeepAll is the baseline number that should have been the same across different methods had the dataset, implementation are standardized, is that correct?" Yes, it is correct.

About your question: there could be several differences among the way in which each method is implemented that lead to a different DeepAll, also if all the methods use the same backbone as starting point. It could be used a different learning rate, batch size, a different data augmentation. So, in an ideal world each method that use the same backbone should have the same DeepAll, but since it is not possible (also because not all the algorithms provide the code, so it is not always possible to see in details the implementation choices) we think that it is more fair to report the DeepAll for each method.

Furthermore, from Table 1 and 2 of our work you can see that our DeepAll, in almost all cases, is higher than the others: we tried to compare our method with the more powerful version of DeepAll in order to see the actual gain. These methods in literature (with which we compare) show you what happens in settings where the DeepAll is quite low, but you don't know if those methods will actually work once you have raised DeepAll.

I hope that I have answered to your questions!

vihari commented 4 years ago

Thanks much for the quick response. I highly appreciate that you report DeepAll for all the methods. This brought much needed clarity since many other DG papers which use these datasets directly compare without any indication that some/much of the improvement is obtained from better implementation. Just a quick follow up question:

These methods in literature (with which we compare) show you what happens in settings where the DeepAll is quite low, but you don't know if those methods will actually work once you have raised DeepAll.

Although, I see your point I feel we cannot be sure of it. MLDG and DeepC (referring to table 1 of your paper) improve over DeepAll by 65.27->69.26 (+3.99) and 67.24->70.01 (+2.77) respectively compared to 71.52->73.38 (+1.86) of JiGen. The improvements of MLDG and DeepC may suffer when using better implementation of JiGen but it is hard to answer if it would be better or worse than JiGen. What are your comments?

I find this problem quite unsettling, so in our paper we refused to compare beyond the method whose implementation we used: JiGen -- https://arxiv.org/abs/2003.12815.

silvia1993 commented 4 years ago

"The improvements of MLDG and DeepC may suffer when using better implementation of JiGen but it is hard to answer if it would be better or worse than JiGen."

Yes, I get your point but it is infeasible to mount all the methods over our baseline implementation. So, I think it's enough to report the DeepAll for all the methods for a fair comparison.

vihari commented 4 years ago

Thanks, that answers my questions.

fmcarlucci / JigenDG

Comparison on PACS and VLCS #22