conveyal / r5

Developed to power Conveyal's web-based interface for scenario planning and land-use/transport accessibility analysis, R5 is our routing engine for multimodal (transit/bike/walk/car) networks with a particular focus on public transit
https://conveyal.com/learn
MIT License
272 stars 71 forks source link

Make human-friendly names available and allow multi-grid downloads #932

Closed abyrd closed 3 months ago

abyrd commented 4 months ago

The backend returns download URLs as small JSON objects with a url property. In this PR, these JSON responses are extended to also have a humanName property, which is an alternate filename that is more human readable but not guaranteed to be unique. This enables #673 (as discussed in https://github.com/conveyal/r5/issues/673#issuecomment-1912000319) but should be backward compatible with existing UI as responses still have the same old url property. The new humanName property can be used in various ways such as the download attribute of an HTML link or as the attachment name in a content-disposition header.

Regional result CSV download links seem to be the odd one out: they return text/plain instead of a JSON wrapped URL. Switching them over to use JSON wrapped URLs like everything else would require an accompanying change on the UI side but would also enable the CSV downloads to use human readable names. The updated code to return the JSON wrapped URL with humanName is already present in RegionalAnalysisController.getCsvResults but commented out for the time being.

This PR also adds and endpoint at /api/regional/:_id/all that will generate all available GeoTIFF rasters for a given regional analysis and store them in a persistent ZIP file, returning one of these new-style JSON wrapped download links including a human-readable name. An attempt is also made to replace any UUIDs in the names of files inside this ZIP with relevant human-readable strings.

Many of the changes here are just small changes to return types, MIME types etc. There is one big block add/remove RegionalAnalysisController which is mostly factoring the action of extracting a single grid out of the HTTP controller, so it can be called once by the existing grid-retrieval HTTP controller, but multiple times in a loop by the new bulk-zip-creating HTTP controller. The smaller targeted changes there are somewhat camouflaged/exaggerated by the indentation and movement of that big block.

abyrd commented 3 months ago

Added one commit to address some of the review comments. Further changes will require some decisions about where it's appropriate to make database calls.

abyrd commented 3 months ago

Recent commits factor out humanNameForEntity method that creates names in the format User_assigned_name_a1b2c3 for any Model instance, including RegionalAnalysis and OpportunityDataset. This is used by a revised getSingleCutoffGrid method, which returns a new record type (FileStorageKey storageKey, String humanName). The getSingleCutoffGrid method now also receives the cutoff in minutes and looks up the index into the cutoffs array as needed. The same approach to generating human readable names for grid files is used everywhere, in both the bulk zip download and retrieving individual files.

abyrd commented 3 months ago

Although this has been merged, I'm adding follow-up observations here to keep them together with the original discussion. On today's call we identified a problem with the multi-file ZIP download. In regional analysis with millions of origin points and say 40 different cutoff/destination/percentile combinations, creating the zip file can be quite slow, taking over 2 minutes. It's not appropriate to stall an HTTP request for so long with no indication of progress. In the absence of feedback users are likely to re-initiate the creation/download process multiple times from a command line, script, or UI. This could lead to multiple stalled HTTP handler threads all creating the same grid and zip files and overwriting each other's results.

In initial testing I had not noticed this slowness, so the first step was to identify which part of the process is slow. This could be extracting the single-cutoff values from the multi-cutoff regional results file, encoding the geotiff file (or grid file, which is similarly slow), or transferring these resulting files to S3. We don't have enough log messages to immediately see the timing of each step. But by comparing file modification times (with ls --full-time for high time resolution) and log messages, about 13 seconds elapse between the log message about starting to extract the single-cutoff grid and the geotiff file appearing on the filesystem. The file is timestamped one second later on S3. So apparently S3 transfer is not the culprit, and it's either extracting the single-cutoff data or encoding/compressing the geotiff and grid file that is slow. These files are only about 3-6MB so this is not expected to be slow at all - we need to investigate this. Maybe we're using some un-buffered file IO?

Assuming creating these files remains "slow" (more than a few seconds), the solution will probably involve a mechanism like the one we use in network building, where an HTTP response is always immediate. If the object exists, it's immediately returned. Otherwise the background process is started and we respond with a 202 ACCEPTED (see AnalysisWorkerController.handleSinglePoint L68). But the AsyncLoader mechanism used for networks is not very appropriate for this situation: the results of the multi-file ZIP operation are not held as an object in memory, and though that AsyncLoader aims to be reusable/generic, it is a cacheing mechanism incorporating some network-build specific ideas.

Aside: as noted in AnalysisWorker#handleAndSerializeOneSinglePointTask, "Ideally we'd stall briefly using something like Future.get(timeout) in case loading finishes quickly" so not all requests for newly created entities result in a 202 and retry, but that's another topic.

abyrd commented 3 months ago

Looking into the speed of individual multi-cutoff to single-cutoff conversions. I added some log statements around the lines that produce the single-cutoff files and ran a regional analysis for Portland (57k origins). It takes around 200ms to derive a single-cutoff file. This is 52x smaller than the original culprit regions with around 3M origins. 52 * 200 ms gives 10s, roughly aligning with experience on those larger regions.

Wrapping input and output with buffered streams makes little to no difference, probably because input and output are both passing through gzip streams which contain their own buffers. Sampling CPU in VisualVM shows inflaterInputStream.read as the most time-consuming call. Reading time is about double writing time.

Unzipping the source .access file and removing the gunzip input stream halves the time to 100 ms instead of 200 ms. Proportions are reversed, with 2/3 of time now in writing instead of 2/3 in reading. Removing the gzip output stream (in addition to the gunzip input stream) still produces files usable by the UI. This drops the conversion time to around 30 ms, so an 85% decrease relative to the original 200 ms. The input and output files are observed to be about 3x bigger without compression.

Using command line tools, gunzipping and re-gzipping the entire multi-cutoff source .access file source data takes 10 and 50 ms respectively. This is much faster than the 170 ms compression seems to be contribute in the Java case.

Initial conclusion is that streaming compression/decompression is the main culprit, perhaps in how it interacts with the byte-by-byte little-endian DataInputStream. It would be interesting to see how much faster this would be if performed with NIO byte buffers.

It's also worth considering whether files need to be compressed at rest, and whether we should just derive all the grid and tif files as part of the regional analysis process, instead of extracting them one by one from the .access file. Compression is expected to be fast and cheap though. We don't want to incur it on every HTTP download instead of once when the file is created, simply to work around the fact that the gzip streamed compression is slower.

I also tried increasing the read buffer size with new GZIPInputStream(rawInput, 65536) and it didn't seem to make any difference.