department-for-transport-BODS / bods-data-extractor

A python client for downloading and extracting data from the UK Bus Open Data Service
Other
46 stars 8 forks source link

CM - Data pull optimizations and refactoring #82

Closed mullinscr closed 1 year ago

mullinscr commented 1 year ago

I've tried to keep the same logical flow but have optimized some aspects of the data pull -- however I have not looked at the timetable generation, as I am aware that that is having changes made to it. Generally this has been done with refactoring the code for logical or clarity reasons, utilizing more pythonic code styles, reducing repetition, removing redundant methods etc.

As an analyst for a local authority, I'm particularly keen (as I believe many consumers would be) on utilizing the package for area based queries (in my case, Leicester and Leicestershire admin areas). As such I have now improved the "pull_timetable_data" method to incorporate the atco_code at the Timetable API request level as opposed to filtering out the wanted services after downloading the whole dastaset. This change, saves a very, very large amount of time for my use case.

The other change is to include a threaded parameter to the TimetableExtractor __init__(). If set to to "True", then multithreading is used to significantly increase the dataset download time. Please see the two examples below for timings on my (not high-performance) laptop:

# Executes in roughly 25 seconds, detailing just Leicester and Leicestershire services
TimetableExtractor(key, service_line_level=True, atco_codes=['260', '269'], threaded=True)

# Executes ~ 7 minutes, fetching all available datasets (963 timetable dataset at the time of writing).
TimetableExtractor(key, service_line_level=True, threaded=True)