Closed kuanb closed 7 years ago
Thanks for the review, https://github.com/UDST/urbanaccess/pull/13/commits/5379f4716e2915da75540563f6bb23e095965a86 introduces the requested modifications (remove TODO, implement suggestion) and also reduces subsetting operations on dataframes to make the code more DRY in that section.
Ready for a squash/merge if nothing else remains!
As I attempted to resolve the issue I noted in this PR, I "unravelled" a number of components within
utils_format.py
that were broken. This PR fixes the errors insofar as it assumes that resolving the errors does not negatively affect the quality or accuracy of the output dataframes.I noticed there is a Travis file. Also, looking through the tests, it looks like this segment my PR covers is missing tests. I'll make an effort to produce some tests to cover this component as well so that, in the future, modifications can be made on these components with confidence.
That said, I wanted to open the PR right now to "get the discussion going" and also ensure that my modifications were not in anyway creating observable negative downstream side effects (i.e. rows being left in the dataframes, etc.).
Issues being resolved in this PR:
In
calendar_dates_agencyid
, the merge withtmp2
requiresservice_id
, yet it is not provided in the prior dataframes used to create that variable.Resolving that unveils another issue: The subsitution step in
calendar_dates_agencyid
includes the conversion of spaces to underscores:sub(r'\s+', '_', col)
. This errors when performed on a dataframe column. I note that there are examples showing this is doable, perhaps it is not supported on some versions of Pandas. Because we don't specify exactly which version of each dependency we use (the greater than or equal option is used in therequires.txt
), we should opt for a more conservative method in this step.Resolving that step reveals a number of attempted merges on Series instead of dataframes. For example, this line in
calendar_agencyid
:merged_df = pd.merge(calendar_df['service_id'], tmp2, how='left', on='service_id', sort=False, copy=False)
.calendar_df['service_id']
returns a Pandas Series. It needs to be nested once more to retain the dataframe.Line 426-5 has the same issue of their not being a
service_id
included in the initial few dataframes, as well.stop_times_agencyid
function attempts to merge a Series (trips_df['trip_id']
) with a DataFrame, as well.Also, issue here (with same code from Item 5) is that you can't left join on
route_id
if it is not included at all in one of the dataframes.The
elif len(agency_df['agency_name']) > 1:
logic segment never udpates the overalldf_list
global, resulting in aKeyError
when looking forunique_agency_id
in the next step.General
TODO
s identifying brittle points in the logic.