geoarrow / geoarrow-rs

GeoArrow in Rust, Python, and JavaScript (WebAssembly) with vectorized geometry operations
http://geoarrow.org/geoarrow-rs/
Apache License 2.0
257 stars 17 forks source link

Question about geoarrow memory layout and inter-language communication #623

Closed tumluliu closed 6 months ago

tumluliu commented 6 months ago

Dear geoarrow and arrow experts, I am trying to understand the memory usage and layout of a geoarrow object, e.g. a MixedGeometryArray object especially when passing its memory block between different languages. I created a minimal example in Rust:

let wkt_geoms = ["LINESTRING(-71.150281 42.248729,-71.150837 42.249113,-71.151144 42.24932)"];
let mut builder = StringBuilder::new();
wkt_geoms.iter().for_each(|s| builder.append_value(s));
let arr = builder.finish();
let crs = r#"THAT EXTREMELY LONG PROJJSON STRING FOR EPSG:4326"#; // is it really necessary to provide such a long string instead of only "EPSG:4326"?
let meta = Arc::new(ArrayMetadata {crs: Some(Value::String(crs.to_string())), edges: Some(Edges::Spherical)});
let geom_arr = MixedGeometryArray::<i32>::from_wkt(&arr, Default::default(), meta, false).unwrap();
let s = geom_arr.num_bytes();
println!("geom_arr.num_bytes(): {s}");

And I got this output:

Finished dev [unoptimized + debuginfo] target(s) in 1.58s
  Running `target/debug/geoarrow-rs-test`
geom_arr.num_bytes(): 52

Then my questions are basically around this 52 bytes:

  1. is that really true that the created MixedGeometryArray variable geom_arr only occupies 52 continuous memory? The number makes more or less sense to me: 6 float64 number + some metadata is roughly 52 bytes, which is really condensed.
  2. are the geotype, i.e. LINESTRING in this example and the srid, i.e. EPSG:4326 in that super long PROJJSON format also included in this 52 bytes?
  3. is there some way to store these 52 bytes into a Vec<u8> so that I can see those binary stuff? with std::io::Write maybe?
  4. where can I get the starting address of these 52 bytes?
  5. if I want to "share" this memory block to another language, e.g. C or Python, and the other language would probably need to persist this binary chunk onto a disk so that in the future it can restore it back into the memory, what would be the best way for transferring/sharing these 52 bytes?

I have some potential solutions to 5:

a). via Arrow FFI (as Kyle kindly suggested in another ticket, and sorry that I'm still somehow stuck here...) - this is hard :/ and I try to avoid unsafe code blocks b). write the raw 52 bytes memory chunk into a Vec<u8> and send it to the other language c). via Arrow's Buffer d). via geoarrow's RecordBatch and GeoTable

Any suggestions or hints will be very appreciated. Thanks a lot!!

kylebarron commented 6 months ago
let crs = r#"THAT EXTREMELY LONG PROJJSON STRING FOR EPSG:4326"#; // is it really necessary to provide such a long string instead of only "EPSG:4326"?

The spec currently defines PROJJSON as the way to materialize CRS. In particular, PROJJSON has some nice benefits around being able to interpret the CRS without access to PROJ.

The CRS is stored only once per column in a table, so it's very little overhead.

  1. is that really true that the created MixedGeometryArray variable geom_arr only occupies 52 continuous memory? The number makes more or less sense to me: 6 float64 number + some metadata is roughly 52 bytes, which is really condensed.

Yes.

  1. are the geotype, i.e. LINESTRING in this example and the srid, i.e. EPSG:4326 in that super long PROJJSON format also included in this 52 bytes?

No. Though here the API is mixed, that metadata is stored on the Arrow field, separately from the array buffers themselves.

  1. is there some way to store these 52 bytes into a Vec<u8> so that I can see those binary stuff? with std::io::Write maybe?

Well, you can export to an Arrow Arc<dyn Array> (here a UnionArray) and traverse its buffers. But those 52 bytes aren't guaranteed to be a single Vec<u8>. You can refer to the Arrow Dense Union docs, which defines how it's laid out in memory.

  1. where can I get the starting address of these 52 bytes?

There are a few different ways, but I really don't think you want this.

  1. if I want to "share" this memory block to another language, e.g. C or Python, and the other language would probably need to persist this binary chunk onto a disk so that in the future it can restore it back into the memory, what would be the best way for transferring/sharing these 52 bytes?

If you're working with another language that can access the same memory space, e.g. C or Python, Arrow FFI is the publicly recommended approach. You shouldn't need to write any unsafe blocks yourself I don't think. The upstream Arrow crate should manage the unsafe parts for you.

Otherwise you want to use Arrow IPC, which will maintain the field metadata as well. See arrow-ipc or write_ipc_stream

tumluliu commented 6 months ago

Thanks very much @kylebarron for your very detailed reply! I am kinda looking for a way to bundle the array buffer and metadata together (geometry + srid), have them stored in a continuously memory chunk and share it with C (or Python). The separation of FFI_ArrowArray and FFI_ArrowSchema bothered me a bit from the C side. I actually only need a persistable and restore-able memory block (as condensed as possible) without the necessity of interpreting its internal details in C. Anyway, I will continue to try with arrow::ffi. Meanwhile, I will also try geoarrow's write_ipc_stream, which requires to create the GeoTable with RecordBatch if my understanding is correct. I hope the GeoTable way won't bring too much extra overhead. Will post here some updates if I've got any success. Thanks again!

kylebarron commented 6 months ago

have them stored in a continuously memory chunk and share it with C

If you don't want to use FFI, use Arrow IPC. This is what it's there for. You can serialize to a single Vec<u8> which stores both the schema and all array chunks.

The separation of FFI_ArrowArray and FFI_ArrowSchema bothered me a bit from the C side

Why did it bother you? You can describe many chunks of an array with only a single schema object, which is much more efficient than copying the schema with every array chunk.

the GeoTable way won't bring too much extra overhead.

The overhead is... on the order of two vec allocations?

tumluliu commented 6 months ago

well, regarding the overhead concern I mentioned, in most of my cases there will be only one single simple OGC geometry object without any extra column or attribute to share with C. I've considered simply using ewkb/wkb bytes, but gave up because it's too sparse (122 Bytes for the above example linestring) and started trying with geoarrow. In the future, there might be some use cases containing a table- or subset-like data package, but unsure yet at the moment. Anyway, I think I will go with Arrow IPC and GeoTable

kylebarron commented 6 months ago

I've considered simply using ewkb/wkb bytes, but gave up because it's too sparse

I'm surprised that EWKB would be too sparse... it's designed to be as compact as possible. It's not even aligned on 8 bytes.

Anyways, I think the question has been answered and I'll close this.