department-for-transport-BODS / bods-data-extractor

A python client for downloading and extracting data from the UK Bus Open Data Service
Other
46 stars 8 forks source link

Mixed Dtypes warning when generating timetables #22

Closed spencer-b-318 closed 1 year ago

spencer-b-318 commented 1 year ago

Console output warns columns have mixed types after "Calling Naptan API to get lat/lon for each stop... " and before "Generating timetable"

Steps to reproduce the behavior:

  1. set stop_level = True
  2. run for any number of datasets
  3. See error in console
  4. code example:
timetable_ = TimetableExtractor(api_key=api # Your API Key Here
                                 ,limit=1 # How many datasets to view
                                 ,status = 'published' # Only view published datasets
                                 ,service_line_level=True # True if you require Service line data 
                                 ,stop_level=True # True if you require stop level data

                               )

Expect to see no warnings/errors on timetable generation

Screenshots

image

Desktop (please complete the following information):

EOstridge commented 1 year ago

We discussed the suggested fix using low_memory=False and the alternative to set each of the data types on import. We decided it was best to move forward with the low_memory=False fix as the difference in performance was not noticeable when testing and this suppresses the dtype warning.

lldwork commented 1 year ago

there is a third alternative solution - low_memory=False not necessarily best practice. Was testing done with full dataset? dtype does take longer but it allows you to easily spot incorrect data - just be aware why the dtype occurs (file is split into chunks for processing when low_memory=True, if one chunk has all integers for one column and another has strings with mix of numbers and letters, you will get the dtype error.

spencer-b-318 commented 1 year ago

Recommendation from @lldwork is to not suppress the error, kicking it down the line. Instead to format the columns correctly (only 4 of them) using column dtype. Should be low effort.