department-for-transport-BODS / bods-data-extractor

A python client for downloading and extracting data from the UK Bus Open Data Service
Other
46 stars 8 forks source link

Refactoring header information and order otc_db_download file #78

Closed EOstridge closed 1 year ago

EOstridge commented 1 year ago

As a data consumer, it would be useful to refactor the otc_db_download file in the following ways to improve accessibility.

Suggestion 1

  1. Include a standard header with version, update date, author etc.
  2. Also in the header, a brief overview of the order of execution and explanation of functions would be useful.
  3. Ensure the functions are written in a logical order, e.g. main function that calls others, then the other functions to follow either in the order they are called or most functional to least functional (helper) functions.

Suggested order for refactoring:

  1. get_otc_db(save=True) (see explanation in ticket #)
  2. Then a much simpler save() function which combines the today/download folder/save code.

Suggestion 2

west_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_H.csv' west_midlands = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_D.csv' london_south_east = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_K.csv' north_west_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_C.csv' north_east_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_B.csv' east_england = 'https://content.mgmt.dvsacloud.uk/olcs.prod.dvsa.aws/data-gov-uk-export/Bus_RegisteredOnly_F.csv'

If the above links will not change, for readability, please save a base URL (up to the word export) and then have a base URL + the csv part for each area.

lldwork commented 1 year ago

Thanks Emily!

Can I please add a suggestion 3 - use the existence (or not) of the 'today folder' (and if necessary the file name in that folder) to only run the download code if haven't already downloaded it today (will save a lot of time!)

lldwork commented 1 year ago

Suggestion 4, remove fetch db and have a method with get_otc_db(save=False) - for example:

def get_otc_db(save=False):

instantiate a list in which each regions dataframe will reside

otc_regions = []

# loop through regions as per OTC DB page
for region in otc_db_files:
    # print(f'Downloading region: {region}...')
    # get the raw text from the link, should be a csv file
    text_out = requests.get(region).content

    # convert to a dataframe
    df = pd.read_csv(io.StringIO(text_out.decode('utf-8-sig')))

    # append current region df to regions list
    otc_regions.append(df)

# combine to a single dataframe
print('Merging files...')
otc_db = pd.concat(otc_regions)
# otc_db.drop(otc_db.columns[0], axis = 1, inplace = True)
otc_db['service_code'] = otc_db['Reg_No'].str.replace('/', ':')

# postgresql does not like uppercase or spaces - removing from column titles
otc_db.columns = [c.lower() for c in otc_db.columns]
otc_db.columns = [c.replace(" ", "_") for c in otc_db.columns]

# remove duplicate rows
otc_db = otc_db.drop_duplicates()

if save:

    create_today_folder()
    downloads_folder = get_user_downloads_folder()

    save_loc = downloads_folder + '/' + today + f'/otc_db_{today}.csv'

    otc_db.to_csv(save_loc, index=False)   
return otc_db
lldwork commented 1 year ago

Suggestion 5

consider more succinct ways of getting user download folder and making today folder - for example, there's no need to list out the contents of the directory. How about if get_otc_db is true, the user can choose where to save it to? especially for the error for non-recognised OS - rather than interrupting programme flow completely, just change this to allow user to select location.

lldwork commented 1 year ago

@EOstridge please feel free to separate into different issues if so required

adamakram1 commented 1 year ago

Changes have been addressed in the Data pull CM - Data pull optimizations and refactoring #82