UDST / urbanaccess

A tool for GTFS transit and OSM pedestrian network accessibility analysis by UrbanSim
https://udst.github.io/urbanaccess/index.html
GNU Affero General Public License v3.0
236 stars 56 forks source link

enable utils_format.py and load.gtfsfeed_to_df to complete successfully #13

Closed kuanb closed 7 years ago

kuanb commented 7 years ago

As I attempted to resolve the issue I noted in this PR, I "unravelled" a number of components within utils_format.py that were broken. This PR fixes the errors insofar as it assumes that resolving the errors does not negatively affect the quality or accuracy of the output dataframes.

I noticed there is a Travis file. Also, looking through the tests, it looks like this segment my PR covers is missing tests. I'll make an effort to produce some tests to cover this component as well so that, in the future, modifications can be made on these components with confidence.

That said, I wanted to open the PR right now to "get the discussion going" and also ensure that my modifications were not in anyway creating observable negative downstream side effects (i.e. rows being left in the dataframes, etc.).

Issues being resolved in this PR:

  1. In calendar_dates_agencyid, the merge with tmp2 requires service_id, yet it is not provided in the prior dataframes used to create that variable.

  2. Resolving that unveils another issue: The subsitution step in calendar_dates_agencyid includes the conversion of spaces to underscores: sub(r'\s+', '_', col). This errors when performed on a dataframe column. I note that there are examples showing this is doable, perhaps it is not supported on some versions of Pandas. Because we don't specify exactly which version of each dependency we use (the greater than or equal option is used in the requires.txt), we should opt for a more conservative method in this step.

  3. Resolving that step reveals a number of attempted merges on Series instead of dataframes. For example, this line in calendar_agencyid: merged_df = pd.merge(calendar_df['service_id'], tmp2, how='left', on='service_id', sort=False, copy=False). calendar_df['service_id'] returns a Pandas Series. It needs to be nested once more to retain the dataframe.

  4. Line 426-5 has the same issue of their not being a service_id included in the initial few dataframes, as well.

  5. stop_times_agencyid function attempts to merge a Series (trips_df['trip_id']) with a DataFrame, as well.

  6. Also, issue here (with same code from Item 5) is that you can't left join on route_id if it is not included at all in one of the dataframes.

  7. The elif len(agency_df['agency_name']) > 1: logic segment never udpates the overall df_list global, resulting in a KeyError when looking for unique_agency_id in the next step.

  8. General TODOs identifying brittle points in the logic.

kuanb commented 7 years ago

Thanks for the review, https://github.com/UDST/urbanaccess/pull/13/commits/5379f4716e2915da75540563f6bb23e095965a86 introduces the requested modifications (remove TODO, implement suggestion) and also reduces subsetting operations on dataframes to make the code more DRY in that section.

Ready for a squash/merge if nothing else remains!