dabreegster / odjitter

Disaggregate zone-based origin/destination data to specific points
Apache License 2.0
12 stars 6 forks source link

Revisit library API and consider streaming #50

Closed dabreegster closed 1 year ago

dabreegster commented 1 year ago

https://github.com/dabreegster/odjitter/issues/49#issuecomment-1645620084 started conversation around providing tighter R integration. That requires revisiting the current library API, which right now manually writes GeoJSON output, to avoid collecting one massive FeatureCollection in memory.

I've also been exploring large-scale use of odjitter in https://github.com/dabreegster/routing-engines recently, and I think it's worth stopping and considering the larger pipeline happening around odjitter. If the overhead of using GeoJSON files as interchange is acceptable, then we don't need bindings for any language; someone can just run the CLI tool, then parse the GeoJSON output. Say this becomes prohibitive for very large outputs, and we change the library to return Vec<(Point, Point)>. Forcing everything into memory will not work at some point. The output of odjitter is probably then fed into a routing engine as requests. The end-to-end pipeline I'm starting to mock up would want to do something like get the next batch of 500 OD pairs, feed into the router, and then ask for the next batch. So actually, a streaming iterator makes more sense to me. If writing a GJ file is still the goal (for the CLI tool, for instance), it should be straightforward to continue the incremental writing that's happening now. If rsgeo conversions like this are the goal, then whoever's writing the bindings can just map over this stream of requests, and decide to return to the other language in one batch, or plumb through some kind of streaming or batching approach there too.

I'm also possibly starting to hit scaling limitations for large subpoint inputs, but not going to worry about that quite yet.

So to summarize next steps:

Robinlovelace commented 1 year ago

If the overhead of using GeoJSON files as interchange is acceptable, then we don't need bindings for any language; someone can just run the CLI tool, then parse the GeoJSON output.

Acceptable for all use cases I've seen including on national OD dataset processing in the CRUSE project.

Feeding the output of odjitter into a routing engine sounds reasonable and if the output is lists of ordered points, rather than full route datasets, that could further speed things up, especially of the output isn't routes but route networks. In some cases keeping and being able to view individual routes can be useful, e.g. if you want to see the origins and destinations associated with a particular segment. Jittering less computationally intensive than routing currently but if the routing stage massively speeds-up and for use cases where input subpoint datasets become large my first impression on this is: sounds good. Supplementary benefit: making it easier to wrap the code 'properly' without calling the CLI with maturin/rextendr if I understand correctly. Quickfire thoughts from train, hope they useful, my interest is an opportunity to learn how to built R/Python packages that wrap Rust libraries like polars and r-polars out of curiosity, any performance benefits being a bonus.

Robinlovelace commented 1 year ago

Additional potential benefit I just thought of: tight integration could help address #48. If pip install odjitter or equivalent code in other languages auto-installs the binary that would be amazing.