Closed EOstridge closed 1 year ago
Thanks Emily!
Can I please add a suggestion 3 - use the existence (or not) of the 'today folder' (and if necessary the file name in that folder) to only run the download code if haven't already downloaded it today (will save a lot of time!)
Suggestion 4, remove fetch db and have a method with get_otc_db(save=False) - for example:
def get_otc_db(save=False):
otc_regions = []
# loop through regions as per OTC DB page
for region in otc_db_files:
# print(f'Downloading region: {region}...')
# get the raw text from the link, should be a csv file
text_out = requests.get(region).content
# convert to a dataframe
df = pd.read_csv(io.StringIO(text_out.decode('utf-8-sig')))
# append current region df to regions list
otc_regions.append(df)
# combine to a single dataframe
print('Merging files...')
otc_db = pd.concat(otc_regions)
# otc_db.drop(otc_db.columns[0], axis = 1, inplace = True)
otc_db['service_code'] = otc_db['Reg_No'].str.replace('/', ':')
# postgresql does not like uppercase or spaces - removing from column titles
otc_db.columns = [c.lower() for c in otc_db.columns]
otc_db.columns = [c.replace(" ", "_") for c in otc_db.columns]
# remove duplicate rows
otc_db = otc_db.drop_duplicates()
if save:
create_today_folder()
downloads_folder = get_user_downloads_folder()
save_loc = downloads_folder + '/' + today + f'/otc_db_{today}.csv'
otc_db.to_csv(save_loc, index=False)
return otc_db
Suggestion 5
consider more succinct ways of getting user download folder and making today folder - for example, there's no need to list out the contents of the directory. How about if get_otc_db is true, the user can choose where to save it to? especially for the error for non-recognised OS - rather than interrupting programme flow completely, just change this to allow user to select location.
@EOstridge please feel free to separate into different issues if so required
Changes have been addressed in the Data pull CM - Data pull optimizations and refactoring #82
As a data consumer, it would be useful to refactor the otc_db_download file in the following ways to improve accessibility.
Suggestion 1
Suggested order for refactoring:
Suggestion 2
west_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_H.csv' west_midlands = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_D.csv' london_south_east = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_K.csv' north_west_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_C.csv' north_east_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_B.csv' east_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_F.csv'
If the above links will not change, for readability, please save a base URL (up to the word export) and then have a base URL + the csv part for each area.