Factual / geo

Clojure library for working with geohashes, polygons, and other world geometry
Eclipse Public License 1.0
303 stars 17 forks source link

Support for H3 #22

Closed willcohen closed 6 years ago

willcohen commented 6 years ago
willcohen commented 6 years ago

@worace This PR is ready for review -- I think it now covers a sufficiently full subset of functions available in the H3 library. There's a few remaining helpers that we could later consider but the differences between them what's currently included here are, I think, pretty minor.

worace commented 6 years ago

Looks promising @willcohen. I probably won't be able to give this a super thorough look until after the holiday but a few things stand out that I'd like to understand better:

BTW did you uncover any javadocs for H3 Java while doing this? Or are you just figuring it out from looking at their source?

willcohen commented 6 years ago

Thanks for giving this a first pass.

H3 doesn't operate on geometries at all – only indices of cells, GeoCoords, and nested arrays of GeoCoords. The only thing we need Polygonal for at the moment is the polyfill function, so the protocol could definitely be eliminated. That said, neither Point nor Shapelike fit the bill. H3's polyfill works on an input of n rings of coordinates, with optional internal rings acting as holes, similar to geo's existing linear-ring and polygon. I could instead make polyfill work on a combination of something like GeoHash, LinearRing, RectangleImpl, and so on, but it'd be a conditional function using instance? similar to what ultimately got replaced with the GeoJSON protocols in #12. Part of what suggested to me that it might deserve be a standalone protocol is the fact that a Polygonal interface exists in JTS already.

As best as I can tell, the string versus long question is an implemention detail, sort of. The overall spec only refers to a 64-bit value for the representation of the index. On the one hand, the provided unix CLI commands use strings. On the other hand, java's Long is actually probably closer to the intent of H3Index than an unbounded string, since it actually is 64-bit.

On a personal level, if I had to pick only one implementation between strings and longs, I'd probably go with strings, if only to be more like geohashes. That said, pretty much every single method is implemented for strings as well as longs (though sometimes the string ones are often just wrappers around the Long versions), and the test suite covers both pretty evenly. That makes it hard for me to see how one is favored over the other. The distinction between the two might just be that strings are better though slightly slower if humans are going to see the index values, and longs might be more performant when it all stays internal.

Lastly, no javadocs that I can see – only the docs in H3Core.java itself.

willcohen commented 6 years ago

Re #18 and #24, the difference between String and Long does appear to be one of efficiency. The "with reflection" version of the (extremely rough) benchmarks below is as of 5c0db47, and the "with protocol" version is as of 7913eb1, with the (into []) also temporarily removed on each for edges, to remove that overhead. Removing reflection increases speed by about 5x overall. Operating on Long is consistently faster than String with and without reflection, with ~20% gain on simple operations and a fair bit more on more intensive ones.

This leads me to think it's probably the right move to support both String and Long – it's a little more complicated to implement, but means that users can transparently prioritize either readability or speed depending on their needs. We might think of the Long as more like the GeoHash object itself, and the String as the human-readable representation, with the difference that the H3 library is happy to perform relatively efficient operations on either natively. @worace

With h3->pt:

;; With reflection 
(time
  (dotimes [_ 100000]
    (h3->pt "871f24ac5ffffff")))
"Elapsed time: 551.70143 msecs"

(time
  (dotimes [_ 100000]
    (h3->pt 608533827635118079)))
"Elapsed time: 513.701249 msecs"

;; With protocol
(time
  (dotimes [_ 100000]
    (h3->pt "871f24ac5ffffff")))
"Elapsed time: 104.001963 msecs"

(time
  (dotimes [_ 100000]
    (h3->pt 608533827635118079)))
"Elapsed time: 85.89182 msecs"

With edges:


;; With reflection
(time
  (dotimes [_ 100000]
    (edges "871f24ac5ffffff")))
"Elapsed time: 418.895697 msecs"

(time
  (dotimes [_ 100000]
    (edges 608533827635118079)))
"Elapsed time: 381.749192 msecs"

;; With protocol
(time
  (dotimes [_ 100000]
    (edges "871f24ac5ffffff")))
"Elapsed time: 96.562609 msecs"

(time
  (dotimes [_ 100000]
    (edges 608533827635118079)))
"Elapsed time: 55.966168 msecs"
willcohen commented 6 years ago

@worace have you given any more thought to how you’d like to proceed here? After taking some time to think about it, I still think doing both string and long makes sense — since one’s more human readable and one’s closer to the 64-bit index they both seem to have their uses.

Finally, if we do keep the new protocol, it’s probably true that Polygonal isn’t the right name. Per JTS, a linestring isn’t actually polygonal. Maybe it gets called Polygonlike, with the idea that the first function it even implements is what actually makes something Polygonal pop out?

worace commented 6 years ago

Yeah i think keeping both string and long is fine. I think Polygonal is fine; I'd kind of prefer to move it into the h3 namespace since it's mostly a concern for that module. What do you think?

willcohen commented 6 years ago

Sounds good -- rebased to master and moved Polygonal to h3.

willcohen commented 6 years ago

After putting this through its paces on real projects a little more over the past week, I also am realizing that geo.io has a blind spot when it comes to multiple geometries. Mapping H3's to-jts to the output from H3's hex-range means that sequences of Geometries get generated. Since QGIS and other tools don't operate nicely on plain GeometryCollection geojsons, it's important to be able to turn a GeometryCollection into sets of feature maps, so that they can be written out using to-geojson-feature-collection. Accordingly, I extended the (to-features) in geo.io to also work with JTS's Geometry and GeometryCollection, since it's a pretty handy normalizing function we've already got.

The other big change since last week was to have h3's to-jts return polygons instead of linear rings, to better match the geohash library's functionality.

@worace

worace commented 6 years ago

have h3's to-jts return polygons instead of linear rings, to better match the geohash library's functionality

Seems good, polygons are probably more intuitive.

Accordingly, I extended the (to-features) in geo.io to also work with JTS's Geometry and GeometryCollection

I guess this is ok but it does make me wonder how or if we should handle the actual GeometryCollection GeoJSON type, or the Multi{Point,Polygon,LineString} types.

Does QGIS handle MultiPolygon any better?

worace commented 6 years ago

Otherwise I think this is looking pretty good. The main reason I have held off on merging is to figure out how I'm going to handle the next (2.0) release. I think there's already quite a lot of big changes for that release, so I'd like to hold this for a subsequent 2.1 release. More info in https://github.com/Factual/geo/pull/22

worace commented 6 years ago

On the Geom/Feature-Collection thing...I'm not really opposed to just defaulting to FeatureCollection even if GeometryCollection is technically more correct since I think it's what people tend to use more often. But I think it at least warrants considering whether there are any common/legit GeometryCollection use-cases that we aren't thinking of.

willcohen commented 6 years ago

Okay. If you want to hold off on this until 2.1, do you think 2.1 would proceed relatively soon after 2.0, or would you want to let this bake for a more extended period of time first?

I also think it's totally fine to default to FeatureCollection for most situations -- the question I have is just around what to actually DO with the GeometryCollections. We don't really have ways of running to-geojson and so on on sequences of other Geometries (or Geometries that should become Features), so it currently seems to make some sense to have some GeometryCollection support if only to help combine those sequences into single objects that can be themselves manipulated.

Current state of affairs with this PR:

A plain MultiPolygon works well in QGIS, since it's just a geometry:

(spit "/Users/foo/Desktop/multipoly.geojson"
      (geo.io/to-geojson
        (geo.jts/multi-polygon [(geo.jts/polygon-wkt [[-1 -1 11 -1 11 11 -1 -1]])
                                (geo.jts/polygon-wkt [[0 0 10 0 10 10 0 0]])])))

A GeometryCollection of that same thing works poorly:

(spit "/Users/foo/Desktop/geocoll.geojson"
      (geo.io/to-geojson
        (geo.jts/geometry-collection 
          [(geo.jts/multi-polygon [(geo.jts/polygon-wkt [[-1 -1 11 -1 11 11 -1 -1]])
                                   (geo.jts/polygon-wkt [[0 0 10 0 10 10 0 0]])])])))

That GeometryCollection converted into a set of feature maps and FeatureCollection works well again.

(spit "/Users/foo/Desktop/featurecoll.geojson"
  (geo.io/to-geojson-feature-collection
    (geo.io/to-features
      (geo.jts/geometry-collection 
        [(geo.jts/multi-polygon [(geo.jts/polygon-wkt [[-1 -1 11 -1 11 11 -1 -1]])
                                 (geo.jts/polygon-wkt [[0 0 10 0 10 10 0 0]])])]))))
worace commented 6 years ago

If you want to hold off on this until 2.1, do you think 2.1 would proceed relatively soon after 2.0

That would be my plan. I don't see any reason not to do a release shortly after.

willcohen commented 6 years ago

@worace How would you like modifications to this PR to go, now that you've merged the branch? My last commit happened after you merged to release/2.1.0. Should I change the PR target to release/2.1.0, fix the last commit to apply to 2.1.0-rc-0, and keep adding additional changes to oush to that branch?

I've started putting 2.1.0-rc-0 into use and trying it out and in polyfills with edge-case sized geometries and fine resolutions the JVM crashes when it calls to C. They're working on some fixes but we should definitely try to make sure that we aren't crashing the JVM here, I think.

Regardless of the eventual upstream solution, I think the workaround solution for here should be first checking whether the polyfill will produce too many hexagons and if so, sufficiently subdivide the JTS geometry as needed and run the polyfill on the smaller shapes. Adjacent polyfills should return non-overlapping hexagonal indices so the only thing the user will see is a slower polyfill, but it means that the h3 namespace now has need for something like hex-area and a few other JTS helpers to do this (like centroids and envelopes) so the subdivision can proceed efficiently.

willcohen commented 6 years ago

@worace Polyfill is now much more cautious when it runs. There's two magic numbers defined at the top of h3. When safe-polyfill-hexagon-maximum is exceeded by polyfill's estimate (based on how h3's maxPolyfillSize works), the polygon is subdivided into pieces and polyfill is run individually on each of those shapes recursively. The polyfill tries to uncompact at the end, but only if safe-uncompact-hexagon-maximum isn't exceeded. This generally should only be an issue when people are polyfilling huge shapes at a very high resolution -- the max checks are still always run, but heavy recursion or compacted polyfill returns generally only come into play when the operations are getting large.

That said, my choices for these magic numbers is based only on what my laptop's stock-ish openjdk JVM seems okay with. If the current settings still lead to crashes, we can always lower them.

Many of these new functions in the h3 namespace are internal, because they duplicate current internal functionality of h3's C core. If h3 changes its polyfill functionality to be more efficient, we'll want to be able to do the same without breaking the API.

worace commented 6 years ago

My last commit happened after you merged to release/2.1.0. Should I change the PR target to release/2.1.0, fix the last commit to apply to 2.1.0-rc-0

I think leave it pointing to master and we can merge from master to the 2.1.0 release branch as needed.

I got to try this out a bit over the last few days for a project and it seems to be working well. I did not run into any of the large-geometry/fine-resolution issues you are describing but glad that you caught them. Bummer that we have to deal with that but but oh well.

I'm not sure we've fully addressed the GeomCollection / GeoJSON issues you raised, but as I understand it, it's not any better or worse off after this PR, so I think we can address it as a separate issue when we figure out what to do.

willcohen commented 6 years ago

Okay. Switched it back to master. Note that it's got release/2.1.0 merged in there one time because this branch was starting to diverge and i wanted to be synced up to 2.1.0-rc-0 as a baseline. If you'd like me to roll that merge back, let me know. I figure I'll push pause on this for right now until 2.0 gets released so there's not too many moving pieces, but as a discrete PR I think this is basically done.

I agree that it might make sense to deal with the geomcollection stuff in a separate request, since that should be a smaller change. As long as 2.1 waits until we resolve it, it's fine. This PR does have the geometries function, which does affect geometrycollections on some level, but I suspect that'll be useful no matter what the final decision is. @worace