MESMER-group / mesmer

spatially-resolved ESM-specific multi-scenario initial-condition ensemble emulator
https://mesmer-emulator.readthedocs.io/en/latest/
GNU General Public License v3.0
23 stars 18 forks source link

Use approximation for great circle distance? #62

Open mathause opened 3 years ago

mathause commented 3 years ago

Currently we use geopy.distance.distance to calculate distances between gridpoints. This uses an exact algorithm, assuming an ellipsoid. For an example dataset of 25 * 53 gridpoints it takes 5min 30 sec to calculate all distances (on my laptop). I think we could reduce this to about 2 min.

However, if we assume a sphere and use the Haversine formula we can get it down to << 1 sec. The error should generally be below 0.5 % (to be confirmed). An additional advantage is that we no longer need to save the distance and ghi_phi matrices.

To be checked and cleaned:

import numpy as np

def distance(s_lat, s_lng, e_lat, e_lng):

    # approximate radius of earth in km
    R = 6371.009

    s_lat = np.deg2rad(s_lat)
    s_lng = np.deg2rad(s_lng)
    e_lat = np.deg2rad(e_lat)
    e_lng = np.deg2rad(e_lng)

    d = np.sin((e_lat - s_lat)/2)**2 + np.cos(s_lat)*np.cos(e_lat) * np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

def dist_all(LON, LAT):
     dist = np.empty((len(LON), len(LAT)))
     for i, (lon, lat) in enumerate(zip(LON, LAT)):
         dist[i, :] = distance(lat, lon, LAT, LON)
     return dist

Current implementation:

import geopy.distance

def calc_geodist(lon, lat):

# from geographiclib.geodesic import Geodesic
# g = Geodesic(6378.137, 1 / 298.257223563)
# g.Inverse(47, 8, 0.589, 0.2593, Geodesic.DISTANCE)["s12"]

    n_points = len(lon)

    geodist = np.zeros([n_points, n_points])

    # could be sped up by only computing upper or lower half of matrix
    # since only needs to be done 1x for every land threshold, not implemented
    for i in np.arange(n_points):
        for j in np.arange(n_points):
            geodist[i, j] = geopy.distance.distance((lat[i], lon[i]), (lat[j], lon[j])).km

        if i % 200 == 0:
            print("done with gp", i)

    return geodist

Example:


import xarray as xr

air = xr.tutorial.open_dataset("air_temperature")

lon, lat = air.lon[:10], air.lat[:10]

# uncomment for full example
# lon, lat = air.lon, air.lat
LON, LAT = np.meshgrid(lon, lat)

%time dist_exact = calc_geodist(LON.ravel(), LAT.ravel())
%time dist_approx = dist_all(LON.ravel(), LAT.ravel())

dist_exact = xr.DataArray(dist_exact)
dist_approx = xr.DataArray(dist_approx)

abs_diff = np.abs(dist_approx - dist_exact)
rel_diff = abs_diff / dist_exact * 100
print("Max error (in example domain):")
print(f"- {abs_diff.max().item():0.2f} km")
print(f"- {rel_diff.max().item():0.2f} %")

rel_diff.plot(robust=True)
mathause commented 3 years ago

And here is the fast and exact solution:

import pyproj

def geodist_exact_fast(lon, lat):

    lon, lat = np.asarray(lon), np.asarray(lon)
    if lon.shape != lat.shape:
        raise ValueError("lon and lat need to have the same shape")

    geod = pyproj.Geod(ellps="WGS84")
    n_points = len(lon)

    geodist = np.zeros([n_points, n_points])

    # calculate only the lower half of the triangle
    for i in range(n_points):

        # need to duplicate gridpoint (required by geod.inv)
        lt = np.tile(lat[i], n_points - (i + 1))
        ln = np.tile(lon[i], n_points - (i + 1))

        geodist[i, i + 1:] = geod.inv(ln, lt, lon[i+1:], lat[i+1:])[2]

    # convert m to km
    geodist /= 1000
    # fill the upper half of the triangle (in-place)
    geodist += np.transpose(geodist)

    return geodist

Uses proj (pyproj) under the hood. Takes 1 s for the 25 53 grid points above and yields allclose as mesmer.io.load_constant_files.calc_geodist_exact. The current solution takes 2.5 min, so this is certainly worth it. For the 2.5° x 2.5° land-only grid (360 / 2.5 180 / 2.5 * 0.3 grid points) it takes about 5 s (in contrast to about 13 min).

So it's still too slow to remove saving & loading the data & my question still stands - could we switch to the approximation?