conveyal / r5

Developed to power Conveyal's web-based interface for scenario planning and land-use/transport accessibility analysis, R5 is our routing engine for multimodal (transit/bike/walk/car) networks with a particular focus on public transit
https://conveyal.com/learn
MIT License
281 stars 73 forks source link

Return single-point results as JSON #825

Open trevorgerhardt opened 2 years ago

trevorgerhardt commented 2 years ago

While experimenting with returned pathSummaries I was looking at the appended JSON block (https://github.com/conveyal/r5/blob/dev/src/main/java/com/conveyal/r5/analyst/cluster/AnalysisWorker.java#L513) and became curious as to the size / speed difference of returning the entire result set as JSON instead of as a partial binary result and partial JSON result.

My initial findings appeared to show no significant difference in size or speed when returning the times data as JSON instead of our custom delta-coded binary response.

I may be forgetting a specific reason which requires us to return single point results as binary data. Additionally, this would be a "worker incompatible" change. But if the reason for custom binary transmission is size & speed this would be a much simpler long term solution for transmission and storage of this data. It would certainly be easier to inspect and debug.

We already have an option for returning a GeoTIFF. Adding a GeoJSON only response would allow us to add this experimentally. Long term, it is likely we'd want to keep the binary format for larger results that need to be optimized. But in most cases a GeoJSON response would probably be favorable for ease of use.

abyrd commented 2 years ago

The backstory is that we started off thinking of the output as images. Image formats already have well-designed filtering to reduce entropy and improve compression, and the browser side should handle filtering and decompression in native code making it efficient. I don't recall the reason we didn't just use PNG (or stopped using PNG) but it probably had something to do with the available bit depth options (and the way filtering is applied across channels) being not quite a good fit for our temporal data. So this binary format started out as a stripped-down borrowing of the useful bits of PNG, and I think this was before we added any of the metadata, errors, or other responses.

The underlying reason was probably just efficiency: avoiding parsing and converting between textual renderings of numbers and the actual binary numbers we'd be operating on, with delta coding to improve compression, transforming all data in-place instead of copying.

I don't know if any serious testing was ever done to compare this against a JSON or text representation - once the binary format was in place the value of consistency with existing workers probably won out over experimentation.

It's worth noting that for numbers less than 5 digits, the text representation is no bigger than a 32-bit binary representation. Without delta coding it is expected not to compress as well, though I don't know how large this effect is for our specific data. Parsing the JSON numbers into actual numbers does take more work, but the bottleneck is probably memory and not these computations. The main consideration is that the JSON approach will involve somewhat more copying.

Basically I agree that just listing out comma-separated numbers is a viable option, and that's why we did a lot of the other result types as JSON. But I think it deserves serious testing since we've got a system in place that has some potential to be more efficient. I'd want to check not just whether it has a user-perceptible impact, but whether it's contributing to overall load behind the scenes, where many small inefficiencies could add up.

trevorgerhardt commented 2 years ago

I agree, thorough testing would be necessary in order to determine whether or not a breaking change as significant as this would be worth it. As of today this is definitely not a priority. I created this issue to capture thoughts/conversation about this topic.

I would suggest that because this change would only effect the transfer of results (we don't store single point results) that the user-perceptible impact is the most important, next being the simplicity and maintainability of the code. While we shouldn't ignore the load behind the scenes, it's less likely to be a factor in these one-off responses.

trevorgerhardt commented 2 years ago

Additional thought: its likely that if the custom delta-coding binary format isn't significantly beneficial vs gzipped JSON here, that may be the case in other places we use a custom binary format as well (like opportunity dataset grids).

While I would consider "freeform JSON" a simpler format, we would probably want to implement a structured schema using something like OpenAPI / JSONSchema. So there would still be some complexity in setting it up.

abyrd commented 1 year ago

While we shouldn't ignore the load behind the scenes, it's less likely to be a factor in these one-off responses.

By "load behind the scenes" I was (unclearly) referring to load on the client machine, and its effect even in rendering one-off responses. If there are two ways of doing something, option A and B may both feel equally snappy when the machine is otherwise idle, but option B might be performing significantly more computation. Ignoring battery depletion it's easy to conclude that the extra work doesn't matter when the user doesn't perceive it. But when the client machine is handling other loads (indexing files in the background, intermittent background operations, stray processes or other buggy webpages chewing up some of the cores) it will get sluggish. This kind of "death by a thousand cuts", where apps or tasks are written to have tolerable performance alone but combine to exhaust available resources with no single cause, is often cited as a main source of contemporary problems in compute performance. For this reason I like the idea of keeping things efficient even when there's compute capacity to spare. Practically I just mean we should measure and consider the effects of CPU load and latency even if they're not perceptible under standard testing conditions.