use aiohttp to fetch multiple state files from S3 in parallel; update format of cached state CSV file

youngj commented 2 years ago

This PR increases the speed of fetching vehicle position state files from S3 by using the aiohttp and asyncio libraries to fetch multiple state files from S3 in parallel.

On my development computer, fetching 24 hours of state files (approximately 5760 files) previously took 422 seconds to complete. With this PR the time was reduced to 82 seconds, a speedup of more than 5x.

This PR configures aiohttp to use 8 parallel requests to the S3 API. In practice, adding more than 8 parallel requests didn't seem to make it much faster.

The per-route cached state CSV files were also simplified. Previously the CSV contained timestamp and secsSinceReport columns. When the state file was loaded, the "real" timestamp was calculated by subtracting the two values. This PR just subtracts the two values first and stores the result in the timestamp column of the CSV. Also, the CSV previously contained a directionId column. However, the reported direction ID for each vehicle was not actually used anywhere, so it was removed from the CSV file and eclipses.py was updated accordingly.

The lines in each CSV file are not necessarily sorted by timestamp, since the parallel requests don't necessarily complete in chronological order, so the lines are sorted by timestamp after reading the CSV file from disk.

This PR still fetches files from S3 in chunks of 1 hour, constructs a route_csv_lines dict in memory and then appends those CSV lines to each route cache file on disk at the end of each hour. Writing to the cache files in batches (after grouping CSV lines by route) first appears to be marginally faster than writing each CSV line directly to a cache file one at a time.

To find opportunities for performance improvements, I used cProfile and pstats as described in https://docs.python.org/3/library/profile.html#profile.Profile . This motivated using the @functools.lru_cache() decorator to memoize the get_state_cache_dir function.

The requests library was also upgraded to avoid a warning when importing the library.

Since requirements.txt has changed, it is necessary to run docker-compose build in order to run the code in this PR.

sidetrackedmind commented 2 years ago

Looks great - I tested vehicle_positions.get_state and python compute_new.py --start-date '2022-03-01'. Both ran smoothly and quickly.

sidetrackedmind commented 2 years ago

I also agree with getting rid of directionId and combining timestamps into one.

codeforpdx / opentransit-metrics

use aiohttp to fetch multiple state files from S3 in parallel; update format of cached state CSV file #7