Closed tumluliu closed 6 months ago
let crs = r#"THAT EXTREMELY LONG PROJJSON STRING FOR EPSG:4326"#; // is it really necessary to provide such a long string instead of only "EPSG:4326"?
The spec currently defines PROJJSON as the way to materialize CRS. In particular, PROJJSON has some nice benefits around being able to interpret the CRS without access to PROJ.
The CRS is stored only once per column in a table, so it's very little overhead.
- is that really true that the created
MixedGeometryArray
variablegeom_arr
only occupies52
continuous memory? The number makes more or less sense to me: 6 float64 number + some metadata is roughly52
bytes, which is really condensed.
Yes.
- are the geotype, i.e.
LINESTRING
in this example and the srid, i.e.EPSG:4326
in that super long PROJJSON format also included in this52
bytes?
No. Though here the API is mixed, that metadata is stored on the Arrow field, separately from the array buffers themselves.
- is there some way to store these
52
bytes into aVec<u8>
so that I can see those binary stuff? withstd::io::Write
maybe?
Well, you can export to an Arrow Arc<dyn Array>
(here a UnionArray
) and traverse its buffers. But those 52 bytes aren't guaranteed to be a single Vec<u8>
. You can refer to the Arrow Dense Union docs, which defines how it's laid out in memory.
- where can I get the starting address of these
52
bytes?
There are a few different ways, but I really don't think you want this.
- if I want to "share" this memory block to another language, e.g. C or Python, and the other language would probably need to persist this binary chunk onto a disk so that in the future it can restore it back into the memory, what would be the best way for transferring/sharing these
52
bytes?
If you're working with another language that can access the same memory space, e.g. C or Python, Arrow FFI is the publicly recommended approach. You shouldn't need to write any unsafe blocks yourself I don't think. The upstream Arrow crate should manage the unsafe parts for you.
Otherwise you want to use Arrow IPC, which will maintain the field metadata as well. See arrow-ipc or write_ipc_stream
Thanks very much @kylebarron for your very detailed reply! I am kinda looking for a way to bundle the array buffer and metadata together (geometry + srid), have them stored in a continuously memory chunk and share it with C (or Python). The separation of FFI_ArrowArray
and FFI_ArrowSchema
bothered me a bit from the C side. I actually only need a persistable and restore-able memory block (as condensed as possible) without the necessity of interpreting its internal details in C. Anyway, I will continue to try with arrow::ffi
. Meanwhile, I will also try geoarrow's write_ipc_stream
, which requires to create the GeoTable with RecordBatch if my understanding is correct. I hope the GeoTable way won't bring too much extra overhead. Will post here some updates if I've got any success. Thanks again!
have them stored in a continuously memory chunk and share it with C
If you don't want to use FFI, use Arrow IPC. This is what it's there for. You can serialize to a single Vec<u8>
which stores both the schema and all array chunks.
The separation of
FFI_ArrowArray
andFFI_ArrowSchema
bothered me a bit from the C side
Why did it bother you? You can describe many chunks of an array with only a single schema object, which is much more efficient than copying the schema with every array chunk.
the GeoTable way won't bring too much extra overhead.
The overhead is... on the order of two vec allocations?
well, regarding the overhead concern I mentioned, in most of my cases there will be only one single simple OGC geometry object without any extra column or attribute to share with C. I've considered simply using ewkb/wkb bytes, but gave up because it's too sparse (122 Bytes for the above example linestring) and started trying with geoarrow. In the future, there might be some use cases containing a table- or subset-like data package, but unsure yet at the moment. Anyway, I think I will go with Arrow IPC and GeoTable
I've considered simply using ewkb/wkb bytes, but gave up because it's too sparse
I'm surprised that EWKB would be too sparse... it's designed to be as compact as possible. It's not even aligned on 8 bytes.
Anyways, I think the question has been answered and I'll close this.
Dear geoarrow and arrow experts, I am trying to understand the memory usage and layout of a geoarrow object, e.g. a
MixedGeometryArray
object especially when passing its memory block between different languages. I created a minimal example in Rust:And I got this output:
Then my questions are basically around this 52 bytes:
MixedGeometryArray
variablegeom_arr
only occupies52
continuous memory? The number makes more or less sense to me: 6 float64 number + some metadata is roughly52
bytes, which is really condensed.LINESTRING
in this example and the srid, i.e.EPSG:4326
in that super long PROJJSON format also included in this52
bytes?52
bytes into aVec<u8>
so that I can see those binary stuff? withstd::io::Write
maybe?52
bytes?52
bytes?I have some potential solutions to 5:
a). via Arrow FFI (as Kyle kindly suggested in another ticket, and sorry that I'm still somehow stuck here...) - this is hard :/ and I try to avoid unsafe code blocks b). write the raw
52
bytes memory chunk into aVec<u8>
and send it to the other language c). via Arrow's Buffer d). via geoarrow's RecordBatch and GeoTableAny suggestions or hints will be very appreciated. Thanks a lot!!