divelab / GOOD

GOOD: A Graph Out-of-Distribution Benchmark [NeurIPS 2022 Datasets and Benchmarks]
https://good.readthedocs.io/
GNU General Public License v3.0
180 stars 19 forks source link

Question of the DIR performance discrepancy between the paper table 13 and leaderboard #15

Closed TimeLovercc closed 1 year ago

TimeLovercc commented 1 year ago
image image

It seems the DIR performances in concept-OODValidation-OODtest are quite different in peper table 13 and leaderboard. (72.14 and 82.96)

CM-BF commented 1 year ago

Hi Zhimeng,

Thank you for your question! There are three reasons as follows.

  1. The OOD generalization problem has not been theoretically solved by DIR, i.e., the lack of guarantee leads to relatively random results with large variances.
  2. The GOOD-Motif dataset is designed as a sanity check that can exaggerate the OOD problem in under structural shifts.
  3. The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Best, Shurui Gui

AGTSAAA commented 1 year ago

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

CM-BF commented 1 year ago

Hi,

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

The tuning process is an automatic process without my interferences. The broader range is only a part of the reason but is not the most important factor. The most significant problem is that DIR strategy cannot guarantee a sucessful subgraph discovery, making its results on this sanity check unspecified, i.e., it has a high hyperparameter sensitivity in this scenario. If one runs the hyperparameter sweeping, one may notice that the performance gap between its best and second best results can be huge.

Best, Shurui Gui

AGTSAAA commented 1 year ago

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

TimeLovercc commented 1 year ago

The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Hi Shurui,

Thank you for shedding light on the differences in the leaderboard results and the paper. My current understanding is:

  1. The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.
  2. You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.
  3. There have been no modifications or updates to the datasets between these two sets of results. Could you please confirm my understanding on these points?

Thank you! Zhimeng

CM-BF commented 1 year ago

Hi Zhimeng,

The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.

Yes, partially. It is not just across runs, but also across different hyperparameters (high sensitivity).

You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.

Yes. The leaderboard results are the latest results. We haven't updated the paper to reflect them.

There have been no modifications or updates to the datasets between these two sets of results. Could you please confirm my understanding on these points?

Yes. Both GOOD-Motif datasets are the same.

Best, Shurui

CM-BF commented 1 year ago

Hi,

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

Thank you for your question! Since you are interested in this insight, I'd like to redirect you to our work LECI. Specifically, you may find Figure 4 and Table 8 useful. In brief, the training of subgraph discovery networks adds one more degree of freedom (structure disentanglement), so without guarantees, the generalization results are unspecified.

In addition, it is critical to note that these synthetic datasets are sanity checks that exaggerate the OOD problems. You may test your initial theory and implementation on them. If your theory is right, you can obtain much higher results. The easiest way to validate is using test domain validations as shown in Table 10 in LECI. Generally, without appropriate theoretical guarantees, the method cannot pass the sanity check even using the test domain validation as we observed.

Best, Shurui

TimeLovercc commented 1 year ago

Hi Shurui,

I really appreciate your timely reply.

Thank you for providing clarity on my previous queries. I have a few more questions, particularly related to the design choices of the GOOD-motif dataset. In the get_basis_concept_shift_list function:

  1. What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?
  2. I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

Best, Zhimeng

CM-BF commented 1 year ago

Hi Zhimeng,

You are most welcome!

What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?

The original purpose is to simulate a scenario in the real world where one can collect data from several environments. Although these environments include data distribution with similar biases, the degrees of the biases are different. This information can contribute to the judgment of whether the strong correlation is spurious or not, under the assumption that data collecting noises from different environments are at the same intensity.

I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

This design is also for simulating real-world scenarios in which it is more practical to collect data similar to the test domain than to obtain data with distributions as the same as the test domain. The validation set is a bridge between the training and testing set. Inspired by DomainBed where oracle domain validations can produce better results, we modify this principle by making the obtainment of validation set more practical.

Please let me know if any questions. :smile:

Best, Shurui

TimeLovercc commented 1 year ago

Thank you for your timely and patient responce. It's quite helpful!

Best, Zhimeng