JuliaGeo / Proj.jl

Julia wrapper for the PROJ cartographic projections library
MIT License
48 stars 9 forks source link

very slow geoid calculation #75

Open alex-s-gardner opened 1 year ago

alex-s-gardner commented 1 year ago

I recently issued a pull request to add geoid function to Geodesy.jl because the geoid conversion in Proj.jl is very slow for large numbers of points [~>500] relative to reading the geoid from a local file. I'm not sure if Geodesy.jl is the right home for such a function or if we just need to improve Proj.jl.

Here is a figure and the corresponding code that demonstrates just how slow using Proj is for geoid to ellipsoid conversions:

geoid_benchmark


# retrieve EGM2008 geoid heights ["EPSG:3855"] above WGS84 ellipsoid ["EPSG:4979"] 
# for `n random points

using GeoArrays, Proj

# file can be downloaded here: 
# https://s3-eu-west-1.amazonaws.com/download.agisoft.com/gtg/us_nga_egm2008_1.tif
geoid_file = "/Users/gardnera/data/geoids/us_nga_egm2008_1.tif"
n = [2, 10, 100, 5000, 10000];
time_local = zeros(size(n));
time_proj = zeros(size(n));

height1 = [];
height2 = [];
for i = eachindex(n)
    lat = (rand(n[i]) .- 0.5) .* 180;
    lon = (rand(n[i]) .- 0.5) .* 360;

    ## Approach 1: read directly from 1/60 degree geotiff
    begin
        t = time();
        # read [lazzily] in EGM2008 geoid as a GeoArray with 1 minute resolution

        egm2008 = GeoArrays.read(geoid_file);

        # find indices of points into cropped dem 
        xy = hcat(lon,lat);
        xy = [copy(r) for r in eachrow(xy)];
        ind = indices.(Ref(egm2008), xy);
        ind = [CartesianIndex((ij[1], ij[2])) for ij in ind];

        # extract values
        height1 = egm2008[ind];
        time_local[i] = time()-t;
    end;

    ## Approach 2: use Proj.jl
    begin
        t = time();
        # enamble network for geoid conversions
        Proj.enable_network!();

        # build transformation 
        trans = Proj.Transformation("EPSG:4326", "EPSG:3855", always_xy=true);

        # project points
        data = trans.(lon,lat,0);

        height2 = -getindex.(data, 3);
        time_proj[i] = time()-t;
    end;
end

using CairoMakie
# Check that both return same resolution
f = Figure()
ax = Axis(f[1, 1],
    title = "validation",
    xlabel = "local",
    ylabel = "Proj.jl"
)
scatter!(ax, height1, height2)

# plot time between appraoches
ax = Axis(f[1, 2],
    title = "timing",
    xlabel = "number of points",
    ylabel = "time [s]"
)
lines!(ax, n, time_local, label = "local")
lines!(ax, n, time_proj, label = "Proj.jl")
axislegend(ax)
f
alex-s-gardner commented 1 year ago

Do we have any maintainers for Proj?

visr commented 1 year ago

Creating the Proj.Transformation is expensive compared to projecting a point. There is no need to reconstruct it every time.

alex-s-gardner commented 1 year ago

Creating the Proj.Transformation is expensive compared to projecting a point. There is no need to reconstruct it every time.

I'm not sure I follow the comment, Proj.Transformation is only generated once in the above example

visr commented 1 year ago

But it's in for i = eachindex(n) right?

visr commented 1 year ago

Oh nevermond I misunderstood

yeesian commented 1 year ago

I guess it isn't a fair comparison to read from a file of precomputed values versus transforming it from points in a different coordinate system haha. One possibility is to reach out to the developers of the underlying PROJ library to see if they might consider adding to https://proj.org/resource_files.html? That way it'll be useful to PROJ users beyond JuliaGeo/Julialang. Update: @visr has a much better response in https://github.com/JuliaGeo/Proj.jl/issues/75#issuecomment-1380041465 and https://github.com/JuliaGeo/Proj.jl/issues/75#issuecomment-1380549902

visr commented 1 year ago

This is the pipeline you are using, with the proj definition:

Transformation pipeline
    description: axis order change (2D) + WGS 84 to EGM2008 height (1)
    definition: proj=pipeline step proj=unitconvert xy_in=deg xy_out=rad step inv proj=vgridshift grids=us_nga_egm08_25.tif multiplier=1 step proj=unitconvert xy_in=rad xy_out=deg step proj=axisswap order=2,1
    direction: forward

Note the grids=us_nga_egm08_25.tif part. If I'm not mistaken this grid is accessed remotely over the PROJ CDN. This is probably what is slow. The docs have some functions related to caching that could be worth looking into. But indeed a better comparison would be to let Proj also use the same local grid that you downloaded, and adjust the pipeline to have it use that. There are also configuration settings and grid cache functions that may help here. The defaults in proj.ini are:

cache_enabled = on
cache_size_MB = 300
cache_ttl_sec = 86400

Also broadcasting trans.(x,y) is probably not ideal, since I imagine it will open and close the grid once per point. It would probably help a lot to use the proj_trans_generic or proj_trans_array for efficiency.

None of those functions are part of the high level interface, but all are available in libproj.jl and can be used. I put some references below. I've never really done this so I'm not sure what the best way to do this is. Though figuring it out and documenting would be very helpful.

network https://proj.org/resource_files.html https://proj.org/community/rfc/rfc-4.html https://proj.org/development/reference/functions.html#network-related-functionality

array https://proj.org/development/reference/functions.html#c.proj_trans_generic https://proj.org/development/reference/functions.html#c.proj_trans_array https://github.com/JuliaGeo/Proj.jl/blob/86b8b09512abee9e6ef8c6d8ab532c3491c65ce6/test/libproj.jl#L215-L268

visr commented 1 year ago

Switching from using remote to local grid makes it 30x faster for me. Exactly how much will also depend on your network.

using Proj, PROJ_jll

# use projsync to get a local copy of the grid we need
run(`$(projsync()) --file us_nga_egm08_25.tif --user-writable-directory`);
# add the directory to the PROJ search path
user_writable_directory = Proj.proj_context_get_user_writable_directory(false)
Proj.proj_context_set_search_paths(2, [Proj.PROJ_DATA[], user_writable_directory])

Proj.enable_network!()
trans = Proj.Transformation("EPSG:4326", "EPSG:3855", always_xy=true)

n = 2000
n = 10000
lat = (rand(n) .- 0.5) .* 180;
lon = (rand(n) .- 0.5) .* 360;

# ensure the z changes, i.e. it found the grid
@assert trans(lon[1], lat[1], 0)[3] != 0

@time trans.(lon, lat, 0);
# 3.535632 seconds; n = 2k
# 17.051302 seconds; n = 10k

I haven't looked at the PROJ code, but I'm wondering where the remaining difference with you GeoArrays code is coming from. I thought that maybe it would help to make fewer ccalls by using proj_trans_array, but this did not result in a speedup.

A = [Proj.Coord(x, y) for (x, y) in zip(lon, lat)]
# proj_trans_array mutates A
Proj.proj_trans_array(trans.pj, Proj.PJ_FWD, length(A), A);
visr commented 1 year ago
# read [lazzily] in EGM2008 geoid as a GeoArray with 1 minute resolution
egm2008 = GeoArrays.read(geoid_file)

It's worth noting that this is not a lazy read, it read the entire file into memory. Probably PROJ doesn't do that, which is a memory use vs speed tradeoff.

alex-s-gardner commented 1 year ago

Closing as it looks like

Switching from using remote to local grid makes it 30x faster

I wonder if @visr example above should be added to the ReadMe?

visr commented 1 year ago

Yes I think it would be good to add it to the readme, it also took me some time to find the right steps. Would you want to add it?

In the future maybe we can add some convenience functions around it, and store the resource files in a scratch space, for which I opened #76.

alex-s-gardner commented 1 year ago

No problem, I'll add this to the ReadMe

alex-s-gardner commented 1 year ago

@visr it seems that on both my Mac and my linux server I have issues:

Downloading from https://cdn.proj.org into /home/gardnera/.local/share/proj
Cannot open https://cdn.proj.org/files.geojson: Cert verify failed: BADCERT_NOT_TRUSTED
Cannot download files.geojson

I tested access from my browser and I can access the file so this does not appear to be a firewall issue. I see that you dealt with a similar issue in GDAL. Any idea what needs to be done here?

visr commented 1 year ago

Hmm that's odd. Something similar to GDAL is done on loading Proj.jl: https://github.com/JuliaGeo/Proj.jl/blob/v1.4.0/src/Proj.jl#L129-L133

Did you load Proj first? Because if you only loaded PROJ_jll, this initialization code is not run.

alex-s-gardner commented 1 year ago

Not too sure what the issue is. Both packages were loaded before run(). I'm using:

Proj v1.4.0
PROJ_jll v900.100.0+0

the run() call looks like this:

julia> `$(projsync()) --file us_nga_egm08_25.tif --user-writable-directory`
setenv(`/home/gardnera/.julia/artifacts/8a643038d2adde781829b4467a32afa307e23b51/bin/projsync --file us_nga_egm08_25.tif --user-writable-directory`,["GDAL_DATA=/home/gardnera/anaconda2/envs/GDAL/share/gdal", "PATH=/home/gardnera/.julia/artifacts/1da0aa3d7598c95b9d7b3a96367b6a8501b6eae4/bin:/home/gardnera/.julia/artifacts/8793267ae1f4b96f626caa27147aa0218389c30d/bin:/home/gardnera/.julia/artifacts/d22cde7583df1d5f71160a8e4676955a66a91f33/bin:/home/gardnera/.julia/artifacts/8a643038d2adde781829b4467a32afa307e23b51/bin:/home/gardnera/anaconda2/envs/GDAL/bin:/home/gardnera/anaconda2/envs/GDAL/bin/aws:/home/gardnera/anaconda2/bin:/u/bylot0/gardnera/bin:/net/devon/mnt/devon2-r1/devon0/gardnera/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin", "CONDA_PROMPT_MODIFIER=(GDAL) ", "PWD=/home/gardnera/Documents/GitHub/ItsLivePlayground", "XDG_SESSION_CLASS=user", "DISPLAY=localhost:10.0", "TERM_PROGRAM=vscode", "VSCODE_GIT_ASKPASS_NODE=/net/devon/mnt/devon2-r1/devon0/gardnera/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/node", "XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop", "CPL_ZIP_ENCODING=UTF-8"  …  "_=/usr/local/bin/julia", "CONDA_DEFAULT_ENV=GDAL", "USER=gardnera", "CONDA_SHLVL=1", "PROJ_LIB=/home/gardnera/anaconda2/envs/GDAL/share/proj", "CONDA_EXE=/home/gardnera/anaconda2/bin/conda", "HOME=/home/gardnera", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=1.74.3", "COLORTERM=truecolor"])
visr commented 1 year ago

Oh wait setting the path to the certificates in libproj only affects libproj, not applications like projsync. For those you have to set the PROJ_CURL_CA_BUNDLE environment variable, which you can do in julia like this:

https://github.com/JuliaGeo/Proj.jl/blob/86b8b09512abee9e6ef8c6d8ab532c3491c65ce6/test/applications.jl#L23

The PROJ_NETWORK doesn't need to be set, since projsync always uses the network.