Open asinghvi17 opened 3 months ago
We also know when it contains a small union, and a function with a conditional like this should allow both optimisations:
function wrap_wkb(::Type{Traits}, ::Val{HasZ}, ::Val{HasM}, ::Val{BigEndian}, wkb) where {Traits, HasZ, HasM, BigEndian}
if PolygonTrait <: Traits && _get_wkb_typenum(wkb) == POLYGON_CODE
return WKBPolygon{HasZ,HasM,BigEndian}(wkb)
elseif MultiPolygonTrait <: Traits && _get_wkb_typenum(wkb) == MULTIPOLYGON_CODE
return WKBMultiPolygon{HasZ,HasM,BigEndian}(wkb)
elseif MultiPoint <: Traits && _get_wkb_typenum(wkb) == MULTIPOINT_CODE
return WKBMultiPoint{HasZ,HasM,BigEndian}(wkb)
... # Repeat for all GeoInterface traits
else
error("type code not recognised")
end
end
The compiler should be able to delete the code for branches not in the Traits
type union.
The idea is this would be used on a vector/column of geometries after a function barrier where the types were worked out for the whole column before the function barrier. In GeoPacakage this can be done with union
on the geometry type column as Integers, then a conversion to types. So one point of instability.
Nested geoms can use views into the wkb, maybe with some type of runtime information to deal with the fact some data is missing from child objects (like endianness) that is present if they're at the top layer.
This should be able to mean all the cost is at the top layer and wrap_wkb
is essentially free wrappers over the wkb data. The cost will increase as the type union gets larger and we lose first stability and then small union optimisations.
(The trait types should probably go in a wrapper rather than a raw type union, as in GeometryOps, but I left it simple for clarity)
As you said @asinghvi17 we would just error in the nighmare datasets where e.g. endianness or HasZ changed down the column.
Then the type parameter BigEndian
would decide if we need to lazily apply bswap
as we return point coordinates. So it would have no cost if the conversion isn't needed.
That sounds good! Another thing - I am not sure if this is the case, but a quick rematerializer for linestrings using reinterpret
may also be useful, since it would batch the work of point
.
For example, this works:
julia> reinterpret(Tuple{Float64, Float64}, rand(UInt8, 32))
2-element reinterpret(Tuple{Float64, Float64}, ::Vector{UInt8}):
(3.2263176797028075e170, 1.0680129991772126e-122)
(-8.397436769436745e-141, 6.732599484892397e-141)
and we could do the same to get a quick reinterpretation of an entire linestring, rather than going point-by-point. I haven't benchmarked this for an entire linestring, but it's 13% slower (1.7ns vs 1.5ns) than indexing into a pre-constructed vector of tuples - which is not bad!
Oh cool. Absolutely. Is it already stored like that in wkb?
Things like this is where knowing HasZ
and HasM
from the start will really help.
You need to parse some stuff for the dimensions and the number of coordinates, but sure. I use reinterpret all over (in serializing mostly), so this would work fine.
I think we want to do that outside the function barrier and then maybe check at each geometry that it matches, and either error or reinterpret
At least for values of "maybe has M" or "maybe has Z" it would have to be dynamic, but when it's defined that the geometry definitely has M or Z, that approach should be fine AFAICT.
We would have to allow the user to choose whether to use a strict or dynamic parser in this case, since it's possible that people made flawed datasets.
Totally. We could have parser=:strict
as the default and error with a messge "Dataset has inconsistent foo, try theparser=:dynamic
keyword"
ls_as_wkb
is a WKB of 300,000 points
# Parsing approach 2 - reduce to reinterpret
@benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb))))
# BenchmarkTools.Trial: 10000 samples with 334 evaluations.
# Range (min … max): 258.111 ns … 3.375 μs ┊ GC (min … max): 0.00% … 89.28%
# Time (median): 274.826 ns ┊ GC (median): 0.00%
# Time (mean ± σ): 280.685 ns ± 97.172 ns ┊ GC (mean ± σ): 1.11% ± 2.94%
# ▁▄█▇▅▂ ▁ ▂▁
# ▁▁▂▃▃▃▄▃▅███████▇████▆▆▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
# 258 ns Histogram: frequency by time 326 ns <
# Memory estimate: 160 bytes, allocs estimate: 4.
which is not bad! GFT.WellKnownBinary(GFT.Geom(), ls_as_wkb) |> GI.getpoint |> collect
is still running as I write this.
ls_as_wkb
is a WKB of 300,000 points
Nice! Can we create a repo, or attach a few of these binary test/benchmark files to a release, so we can add this to CI, and test ourselves as well?
In this case it's as simple as:
import GeoFormatTypes as GFT, WellKnownGeometry as WKG, GeoInterface as GI
tups = tuple.(rand(300_000), rand(300_000))
geoms_as_points = GFT.val.(WKG.getwkb.(GI.Point.(tups)))
geoms_as_linestring = GFT.val(WKG.getwkb(GI.LineString(tups)))
and try the different parsing approaches from there.
# parse geoms as points, each WKB individually
@benchmark begin
map($(geoms_as_points)) do geom
only(reinterpret(Tuple{Float64, Float64}, view(geom, (5+1):length(geom)))) # this has to be parsed differently in EWKB and GeoPkg wkb
end
end # median time 3ms v/s regular parsing 300ms
# parse the whole linestring
@benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb))))
# median time 290ns vs unknown (still running)
# we could even drop collect here
(there was an actual test geopkg file for reproducible coordinates, but I figure that doesn't matter so much for parsing).
Looks like GI.getpoint
hangs because it's calling ncoord
every point...
[2] getgeom(T::GeoInterface.LineStringTrait, geom::GeoFormatTypes.WellKnownBinary{GeoFormatTypes.Geom, Vector{UInt8}}, i::Int64)
@ WellKnownGeometry ~/.julia/packages/WellKnownGeometry/t1n94/src/wkb.jl:190
We could optimize that pretty easily no?
If you know it's a line string, or know the dimension, yes, absolutely.
import GeoFormatTypes as GFT, WellKnownGeometry as WKG, GeoInterface as GI tups = tuple.(rand(300_000), rand(300_000)) geoms_as_points = GFT.val.(WKG.getwkb.(GI.Point.(tups))) geoms_as_linestring = GFT.val(WKG.getwkb(GI.LineString(tups)))
It seems we haven't defined ncoord on things other than Point in GeoInterface? Did you fix this locally?
julia> WKG.getwkb(GI.LineString(tups))
ERROR: MethodError: no method matching ncoord(::GeoInterface.LineStringTrait, ::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})
Closest candidates are:
ncoord(::Any)
@ GeoInterface ~/.julia/packages/GeoInterface/tuF7K/src/interface.jl:104
ncoord(::Union{GeoInterface.GeometryCollectionTrait, GeoInterface.LineStringTrait, GeoInterface.LinearRingTrait, GeoInterface.MultiLineStringTrait, GeoInterface.MultiPointTrait, GeoInterface.MultiPolygonTrait, GeoInterface.PointTrait, GeoInterface.PolygonTrait}, ::GeoFormatTypes.WellKnownBinary{GeoFormatTypes.Geom, <:AbstractVector{UInt8}})
@ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:164
ncoord(::Union{GeoInterface.GeometryCollectionTrait, GeoInterface.LineStringTrait, GeoInterface.LinearRingTrait, GeoInterface.MultiLineStringTrait, GeoInterface.MultiPointTrait, GeoInterface.MultiPolygonTrait, GeoInterface.PointTrait, GeoInterface.PolygonTrait}, ::GeoFormatTypes.WellKnownText{GeoFormatTypes.Geom})
@ WellKnownGeometry ~/code/WellKnownGeometry/src/wkt.jl:136
Stacktrace:
[1] ncoord(geom::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})
@ GeoInterface ~/.julia/packages/GeoInterface/tuF7K/src/interface.jl:104
[2] _getwkb!(data::Vector{…}, type::GeoInterface.LineStringTrait, geom::GeoInterface.Wrappers.LineString{…}, first::Bool, repeat::Bool)
@ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:113
[3] getwkb!
@ ~/code/WellKnownGeometry/src/wkb.jl:129 [inlined]
[4] getwkb(geom::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})
@ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:56
Some type information was truncated. Use `show(err)` to see complete types.
Huh, no this just worked for me...let me try it in a fresh session!
No, I'm on latest released GeoInterface and WKG v0.2.2 (one patch version behind)
Edit: trying this on WKG v0.2.3 fails as you posted.
~Ouch, my bad then it seems.~ In 0.2.3 I call ncoord for the first time on non-point geoms it seems, to indeed check is3d/ismeasured stuff. GeoInterface.wrappers don't implement that yet, but it should (and we need to update our tests there too?).
Monkeypatching
GI.ncoord(::GI.Wrappers.WrapperGeometry{false, false}) = 2
GI.ncoord(::GI.Wrappers.WrapperGeometry{false, true}) = 3
GI.ncoord(::GI.Wrappers.WrapperGeometry{true, false}) = 3
GI.ncoord(::GI.Wrappers.WrapperGeometry{true, true}) = 4
fixes it for now, but this probably needs a better check. Maybe GI.is3d
and GI.ismeasured
?
ls_as_wkb
is a WKB of 300,000 points# Parsing approach 2 - reduce to reinterpret @benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb)))) # BenchmarkTools.Trial: 10000 samples with 334 evaluations. # Range (min … max): 258.111 ns … 3.375 μs ┊ GC (min … max): 0.00% … 89.28% # Time (median): 274.826 ns ┊ GC (median): 0.00% # Time (mean ± σ): 280.685 ns ± 97.172 ns ┊ GC (mean ± σ): 1.11% ± 2.94% # ▁▄█▇▅▂ ▁ ▂▁ # ▁▁▂▃▃▃▄▃▅███████▇████▆▆▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃ # 258 ns Histogram: frequency by time 326 ns < # Memory estimate: 160 bytes, allocs estimate: 4.
which is not bad!
GFT.WellKnownBinary(GFT.Geom(), ls_as_wkb) |> GI.getpoint |> collect
is still running as I write this.
For what it's worth, getpoint
, gets you a point, as in creating a new WKB Point. You probably want coordinates
instead, if it is written roughly like this:
function GI.coordinates(
T::GI.LineStringTrait,
geom::WKBtype,
)
size = headersize + numsize
ncoord = GI.ncoord(geom)
offset = typesize(wkbsubtype(T), geom[size+1:end], ncoord)
reinterpret(NTuple{ncoord,Float64}, @view geom.val[size+1:size+offset*GI.ngeom(geom)])
end
Which while still a bit unstable with ncoord, is already doable with 8.958 μs (15 allocations: 469.34 KiB).
@asinghvi17 do you want to PR your patch to GI? I must have forgotten to add it ( or maybe we never had that, but we should)
@evetion does getpoint
have to return a WKB Point? Why not a Tuple
? coordinates
is not generic because it often allocates.
I thin we need getpoint(geom)
to always be super fast or there is no generic fast way to get points.
Yeah, it probably needs to be a tuple. And if the type has noop ncoord it should be pretty fast, even without refactoring it everywhere.
It would be great to have a parameterized parser object for speed. Things we could specialize on are:
bswap
on the result Float64 or Int)