Closed youngj closed 2 years ago
Looks great - I tested vehicle_positions.get_state
and python compute_new.py --start-date '2022-03-01'
. Both ran smoothly and quickly.
I also agree with getting rid of directionId
and combining timestamps into one.
This PR increases the speed of fetching vehicle position state files from S3 by using the
aiohttp
andasyncio
libraries to fetch multiple state files from S3 in parallel.On my development computer, fetching 24 hours of state files (approximately 5760 files) previously took 422 seconds to complete. With this PR the time was reduced to 82 seconds, a speedup of more than 5x.
This PR configures aiohttp to use 8 parallel requests to the S3 API. In practice, adding more than 8 parallel requests didn't seem to make it much faster.
The per-route cached state CSV files were also simplified. Previously the CSV contained
timestamp
andsecsSinceReport
columns. When the state file was loaded, the "real" timestamp was calculated by subtracting the two values. This PR just subtracts the two values first and stores the result in thetimestamp
column of the CSV. Also, the CSV previously contained adirectionId
column. However, the reported direction ID for each vehicle was not actually used anywhere, so it was removed from the CSV file and eclipses.py was updated accordingly.The lines in each CSV file are not necessarily sorted by timestamp, since the parallel requests don't necessarily complete in chronological order, so the lines are sorted by timestamp after reading the CSV file from disk.
This PR still fetches files from S3 in chunks of 1 hour, constructs a
route_csv_lines
dict in memory and then appends those CSV lines to each route cache file on disk at the end of each hour. Writing to the cache files in batches (after grouping CSV lines by route) first appears to be marginally faster than writing each CSV line directly to a cache file one at a time.To find opportunities for performance improvements, I used
cProfile
andpstats
as described in https://docs.python.org/3/library/profile.html#profile.Profile . This motivated using the@functools.lru_cache()
decorator to memoize theget_state_cache_dir
function.The
requests
library was also upgraded to avoid a warning when importing the library.Since requirements.txt has changed, it is necessary to run
docker-compose build
in order to run the code in this PR.