Comparison values (e.g. FCN) in the paper do not match up

marvingabler commented 3 months ago

Hey folks, first congrats to your paper & hard work!

I am wondering where the comparison values in the paper (e.g. FourCastNet RMSE) is coming from. They seem to be very different than what the authors describe in their papers and what we could experimentally verify last year:

you report an RMSE for FCN of 1.28K for t2m at 6h, while the authors describe roughly 0.74K (which I can verify is correct)
you report an RMSE for FCN of 1.68K for t2m at 24h, while the authors describe roughly 0.94K (which I can also verify)

These numbers would obv change the conclusion of the paper. Before I go deeper into checking also with other variables and ClimaX, I wanted to reach out to check if I am missing something.

To compare your scores easily with other open AI weather models (you can choose target resolutions), I can highly recommend WeatherBench's web UI

One more comment:

GraphCast and Pangu are open & can be used for comparison (in contrast to your statement in the paper)

yogeshverma1998 commented 3 months ago

Hi,

Our work doesn't utilize the total number of variables used in the FourCastNet paper (for training the model), primarily due to academic computational constraints. FCN uses a resolution of 0.25, whereas we use a resolution of 5.625, leading to a 32x64 grid. Thus, a direct-to-direct comparison is unfair.

We have described the variables and resolution of data that we have used in Appendix B. For a fair comparison, we re-run the FCN and ClimaX (without pertaining) with the same hyper-parameters (provided in their official repo), with those variables on 32x64 resolution. This also makes it not directly compared to WeatherBench web UPI.

We acknowledge the usage of the low amount of variables in the spatial domain primarily due to academic computational constraints. Primarly, we wanted to demonstrate that "proper" continuous-time approaches can be viable for weather. We are still progressing in expanding the method to incorporate more variables and compare the scores easily with the WeatherBench web UPI.

GraphCast and Pangu are open & can be used for comparison (in contrast to your statement in the paper)

I think the training codes for Pangu are still not released (which is stated here: https://github.com/198808xc/Pangu-Weather/issues/58), thus making it impossible to have a fair comparison. I think the initial release for Graphcast was made after the ICLR submission deadlines (which were later updated with some instructions to use, etc.). We are still progressing towards a fair one-to-one comparison with Graphcast by adapting their code. However, it is a little bit challenging due to the non-availability of proper documentation regarding training, pre-processing, inference, etc.

marvingabler commented 3 months ago

Hey @yogeshverma1998 thanks for the prompt response & detailed explanation, makes totally sense to me!

I think your approach is quite unique & I want to support further evaluation. I can offer you a few nodes of H100s (with 8 H100s each) for your research experiments, drop me a mail at marvin@jua.ai if that would help!

Aalto-QuML / ClimODE

Comparison values (e.g. FCN) in the paper do not match up #1