Limitation of bounding box approach in PROJ 6 (seen on Italian Monte Mario to WGS84 transformations)

rouault commented 5 years ago

cs2cs / proj_create_crs_to_crs()+proj_trans() use a bounding box based approach to determine which candidate coordinate operation to use. They use the list of coordinate operations returned by proj_create_operations() in its sorted order where the most relevant operation will be sorted first. proj_trans() will iterate over that list until it finds a bounding box where the point to transform is. This is a limitation in cases when the bounding box of an area of use include other ones. This is typically the case when looking at the output of projinfo -s EPSG:4326 -t EPSG:3003 --spatial-test intersects. It returns 3 transformations for

EPSG:1660, AREA["Italy - mainland"], BBOX[37.86,6.62,47.1,18.58], accuracy 4 m
EPSG:1662, AREA["Italy - Sardinia onshore"], BBOX[38.82,8.08,41.31,9.89]]], accuracy 4m
EPSG:1094, AREA["Italy - Sicily Strait west of 13°E"], BBOX[35.28,10.68,38.45,13.01], accuracy 10m

One can notice that Sardinia bbox is strictly contained in mainland, and Sicily partly intersects it as well. So there is no chance that the Sardinia specific transformations will be selected, and little for Sicily A bullet-proof approach would be to use the precise contour of the area of use which are provided by EPSG as GML files, but that will make the database larger and will involve point-in-polygon testing. Another approach would be to use them temporarily and do offline testing where one would add in the database that a given BBOX should exclude other BBOX. In the above case, Italy-mainland would exclude Sardinia and Sicily. Not sure if that would make things simpler implementation wise.

kbevers commented 5 years ago

A bullet-proof approach would be to use the precise contour of the area of use which are provided by EPSG as GML files, but that will make the database larger and will involve point-in-polygon testing.

This one is my dream setup. It obviously adds complexity but it would be interesting to get an estimate on how much the larger DB and polygon lookups affects performance.

Another approach would be to use them temporarily and do offline testing where one would add in the database that a given BBOX should exclude other BBOX. In the above case, Italy-mainland would exclude Sardinia and Sicily. Not sure if that would make things simpler implementation wise.

I can imagine this approach will require quite a bit of trial and error to dial in the exclusion/inclusion rules. There will likely also be areas that are impossible to handle (at least I can't imagine how right now), for instance the overlapping areas of use for Denmark and Sweden. The greater Copenhagen area and a large chunk of Southern Sweden are contained in the neighbouring countrys AoU. It entirely depends on the input which AoU should be used. The polygon approach can obviously handle this (assuming the polygons are correct).

rouault commented 5 years ago

to get an estimate on how much the larger DB

Relatively easy with GDAL After having downloaded the GML files from epsg-registry.org and unzip them, run the following script

import glob
from osgeo import ogr
#ds = ogr.GetDriverByName('GPKG').CreateDataSource('epsg_polygons.gpkg')
ds = ogr.GetDriverByName('SQLite').CreateDataSource('epsg_polygons.db', options = ['SPATIALITE=YES', 'INIT_WITH_EPSG=NO'])
lyr = ds.CreateLayer('polygons', options = ['SPATIAL_INDEX=NO', 'COMPRESS_GEOM=YES' ])
lyr.CreateField(ogr.FieldDefn('area_of_use_code', ogr.OFTInteger))
ds.StartTransaction()
for filename in glob.glob("*.xml"):
    xml = open(filename, 'rt').read()
    geom = ogr.CreateGeometryFromGML(xml)
    geom.SwapXY()
    code = filename.split('_')[1]
    f = ogr.Feature(lyr.GetLayerDefn())
    f['area_of_use_code'] = code
    f.SetGeometry(geom)
    lyr.CreateFeature(f)
ds.CommitTransaction()

This creates a 34 MB large Spatialite file with compressed geometries, or a 67 MB large GeoPackage with uncompressed geometries. (Spatialite compressed geometries using Float32 for delta-encoding of the 2nd vertex and following ones). We may also decide to simplify the contours to reduce the number of vertices when they are too close. For example for a precision of 1e-2 degree (~ 1km): ogr2ogr -simplify 1e-2 epsg_polygons_simplified.gpkg epsg_polygons.gpkg we get a 9 MB GeoPackage (or a 5.2MB compressed Spatialite DB) And for a precision of 1e-1 degree (~ 10 km), but perhaps a bit extreme, we get a 2.7 MB GeoPackage (or 1.9 MB compressed Spatialite DB)

polygon lookups affects performance.

Point-in-polygon tests would only be done once the BBOX criterion has passed. With GEOS implementation (GEOSContains()), when testing if a point is inside France contour (4628 points for continent unsimplified), I get ~1.2 ms per test with

from osgeo import ogr
# France mainland
xml = open('EPSGPolygon_2421_2013-11-05.xml', 'rt').read()
geom = ogr.CreateGeometryFromGML(xml)
geom.SwapXY()

pt = ogr.CreateGeometryFromWkt('POINT(2 49)')
import time
start = time.time()
niters = 5000
for i in range(niters):
    geom.Contains(pt)
end = time.time()
print(float(end-start) / niters)

(the real cost might be a bit less since for each test there's a OGRGeometry -> GEOSGeometry conversion)

That said, we should reimplement our own point-in-polygon code since we probably don't want to depend on GEOS, or find an existing one compatible of X/MIT (that said JTS, whose GEOS is a port, uses http://www.eclipse.org/org/documents/edl-v10.php which is BSD). Boost Geometry (https://www.boost.org/doc/libs/1_65_1/libs/geometry/doc/html/index.html) could also be a potential candidate. Or another option would be for PROJ to have a basic point-in-polygon code, with some pluggable callbacks that GDAL could fill with the GEOS implementation.

GEOS/JTS has also the concept of prepared geometries which could be useful: when proj_create_crs_to_crs() determines that there are several candidate coordinate operations, it could create a prepared geometry for each area of use (to experiment if there's a penalty in doing so that would make it not appropriate for single point transformation), which can speed up significantly point-in-polygon testing, but that would be even more complicated to reimplement from scratch.

mwtoews commented 5 years ago

The dataset is also available as a shapefile ~60 MB uncompressed. Here is the world, subdivided into nine regions defined by IGOP:

world

Potentially these geographic coverages could be split into the regional datasets for proj-datumgrid, but this could be more trouble than it's worth.

As for potential licensing, the metadata states:

The user assumes the entire risk as to the accuracy and the use of this data. The data may be used, copied and distributed subject to the conditions in the terms of use in http://info.ogp.org.uk/geomatics/CurrentDB.html

(I'll assume the correct link is http://www.epsg.org/TermsOfUse.aspx)

rouault commented 5 years ago

Suggestion by Sandro Furieri: to store the coordinate values as decimal degrees * 100, truncated to integer, so to have a 1e-2 deg precision, ~ 1km . So we could store each X/Y value on a signed short, which based on my above 9 MB GeoPackage that used -simplify 1e-2 and double, could bring the size of the contours to 9 / 4 = ~ 2.3 MB.

Potentially these geographic coverages could be split into the regional datasets for proj-datumgrid, but this could be more trouble than it's worth.

Seems more complication to me indeed.

busstoptaktik commented 5 years ago

So we could store each X/Y value on a signed short

Makes sense, but why do decimal scaling when going for binary representation? Doing a binary scaling would offer almost the double precision for the same storage (and computational) costs:

Scaling the numbers by 2^16/360 results in a unit ("binary degree" - "bigree", perhaps?) of 1/182 degree, i.e. 0.6 km worst case.

rouault commented 5 years ago

Scaling the numbers by 2^16/360

Caution: the range of longitudes in the EPSG polygons is [-270,330]. 330 is reached for urn:ogc:def:area:EPSG::4523 (World centred on 150°E) and -270 for urn:ogc:def:area:EPSG::4520 (World centred on 90°W). That still fits on a 16bit range with 1e-2 deg decimal precision, since 330 - -270=600 < 65536/100, but offsetting would be needed because 330*100 > int16_max. Or if using binary representation, 2^16/(330 - -270) => 1 unit = 1. / 109 degree ~ 1km.

busstoptaktik commented 5 years ago

Or if using binary representation, 2^16/(330 - -270) => 1 unit = 1. / 109 degree ~ 1km.

So you would need the binary scaling to reach 1 km resolution anyway. If using the offsetting, probably the test for point-in-polygon should work on the offset values, by offsetting the point under test.

I think it will become a mess no matter what, which leads me to the following alternative solution, which I suspect will be only slightly less compact, and faster to look up - given that we are satisfied with 1 km resolution:

Rasterize each relevant polygon on a 20000 rows by 40000 columns grid
Run length encode each row (most will be all zeros)
Run length encode the run length encoded rows
Create an index giving the starting address of each row
Run length encode the index

(steps 3 to 5 are really one: You RLE identical rows by giving the same starting address in the index - note that the since “all zero rows” will be the common case, these can be given start address 0, and handled up front to speed up things. Also note that identical rows don’t even need to be consecutive to share the same representation)

When instantiating a transformation needing polygon lookup, decode the row index only: I suppose that in most cases the row will consist of zeros only disturbed by a few runs of ones, so decoding on the fly will be fast.

Also, an additional potential speed up and compression could be obtained by adding a bounding box entry to the data structure, checking that first, and only storing the portion of the global grid inside the BBOX.

Note that the resolution of this representation is a parameter, so if we at some time are not satisfied with 1 km resolution, we could increase it by making the grid larger: The RLE will ensure that the growth in representation will be somewhat under control.

Example:

This rasterized polygon:

000000000000
000000000000
111000001100
001100011000
000111110000
000011100000
001100011000
000000000000
000000000000

would be RLEd as

0 12  (read: start with 0, repeat 12 times)
0 12
1 3 5 2 2 (start with 1, repeat 3 shift to 0, repeat 5, shift, etc.)
0 2 3 2 3
0 3 5 4
0 4 3 5
0 2 3 2 3 (note: identical to row 4. Shares representation)
0 12
0 12

(actually, we don't even need to store the last run length of each row, since we know it will last until the end of the row)

Special-casing the zero rows to index 0, the row index would then read:

 0
 0
 1
 6
11
15
19
 6   (shared representation with row 4)
 0
 0

Where the top and bottom zero rows would again be RLEd in the disk representation, but not(?) in the PJ (in many cases they may be compressed away entirely by the BBOX-trick, although some zero rows may remain in cases of disjoint parts)

The rasterization of the polygons could be part of the datumgrid projects, hence making it up to these to decide on which selections of resolutions to offer (as in the case of the GSHHG offering several levels of resolution)

NOTE: While I evidently like this suggestion (otherwise, I would have remained silent), I am not, for the moment, able to back it up with an actual development effort - although it would be fun to do...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

OSGeo / PROJ

Limitation of bounding box approach in PROJ 6 (seen on Italian Monte Mario to WGS84 transformations) #1461