esmf-org / esmf

The Earth System Modeling Framework (ESMF) is a suite of software tools for developing high-performance, multi-component Earth science modeling applications.
https://earthsystemmodeling.org/
Other
155 stars 75 forks source link

For nearest-neighbor remapping, ensure results are independent of processor count if there are equidistant source points #276

Open billsacks opened 1 month ago

billsacks commented 1 month ago

For nearest-neighbor remapping, if there are equidistant source points, there is currently some logic that says that, if there are equidistant source points, arbitrarily use the point with the smallest ID. But, according to @oehmke , that logic isn't done in the multi-processor case, because currently the IDs aren't sent between processors. This results in nearest-neighbor mapping giving different results with different processor counts if there are equidistant source points. @oehmke proposes adding a send of the IDs so that the multi-processor case can break ties using the ID, similarly to in the single-processor case.

Discussed in https://github.com/orgs/esmf-org/discussions/261

Originally posted by **samsrabin** July 10, 2024 ### Requirements - [X] Reviewed [ESMF Reference Manual](https://earthsystemmodeling.org/doc/) - [X] Searched [GitHub Discussions](https://github.com/orgs/esmf-org/discussions?discussions_q=) ### Affiliation(s) NSF-NCAR ### ESMF Version _No response_ ### Issue In [CTSM](https://github.com/ESCOMP/CTSM), we use ESMF to read some input files. One particular pair of input files, specifying crop sowing window start and end dates, is at half-degree resolution. We tell ESMF to do nearest-neighbor[^1] spatial interpolation as necessary to match the simulation grid. [^1]: It needs to be nearest-neighbor because dates are modulo—interpolating between Jan. 2 [day 2] and Dec. 31 (day 365) should give Jan. 1 (day 1), not July 3-4 (day [2+365]/2 = 183.5)—and that's not something ESMF can do, to my knowledge. When I do a run at 10°x15° resolution, some of the simulation gridcell centers are located exactly at the "corners" of four half-degree input pixels, meaning that those four neighbors are equally near. It doesn't matter to me which of those ESMF chooses as the "nearest neighbor," as long as it's consistent. Unfortunately, it's not: At least one gridcell has a different "nearest neighbor" chosen _depending on how many processors the job is split across_. As an example, I've made a figure based on two cases that are identical in setup except that Case 1 used 128 processors and Case 2 used 64. Due to this issue, a certain crop in the gridcell centered at latitude 0, longitude 30°E[^2] gets sowing window of days 7-82 in Case 1 and 336-46 in Case 2. [^2]: There are other crops in this gridcell that also get different sowing windows. There are no crops in any other gridcell that get different sowing windows, but that doesn't necessarily mean different "nearest" neighbors are getting chosen. That might be happening, just with input pixels that don't differ. The white/gray/black in this figure represents the half-degree sowing window files. Gray pixels match the values in Case 1, black pixels match Case 2, and white pixels match neither. The red lines intersect at the center of the 10x15 CTSM gridcell. ![screenshot_1104](https://github.com/esmf-org/esmf/assets/10454527/74af317f-2eaa-4c0d-bffe-78e46c4a08c8) It looks like Case 1 reads from the pixel to the southwest, whereas Case 2 reads from the pixel to the northwest. Some notes: - I'm not 100% certain this is an ESMF issue as opposed to something weird that CTSM is doing, but I'm at the point where I've done all the troubleshooting I can within CTSM. - This reproduces every time, over dozens of tests. Tagging @ekluzek, @billsacks, and @briandobbins, who have expressed interest in this. By the way, I think I mentioned to y'all that I was having an ERP test pass but the equivalent PEM test fail—this is why! The read of sowing windows only happens at the very beginning of the test, so changing processor count halfway through makes no difference. ### Autotag @oehmke
samsrabin commented 3 weeks ago

Following up: Is this something that's on the roadmap to be in the ESMF version used in the CESM3 release? No worries if not, but in that case I'll need to make some of my tooling more robust and official.

oehmke commented 3 weeks ago

Yep, it's on the roadmap to ESMF 8.8.0, which is what we're targeting for CESM3. I'm hoping to get it done soon-ish, so we can make sure that it works awhile before the release.

samsrabin commented 3 weeks ago

Excellent, thanks!