As I attempted to resolve the issue I noted in this PR, I "unravelled" a number of components within utils_format.py that were broken. This PR fixes the errors insofar as it assumes that resolving the errors does not negatively affect the quality or accuracy of the output dataframes.
I noticed there is a Travis file. Also, looking through the tests, it looks like this segment my PR covers is missing tests. I'll make an effort to produce some tests to cover this component as well so that, in the future, modifications can be made on these components with confidence.
That said, I wanted to open the PR right now to "get the discussion going" and also ensure that my modifications were not in anyway creating observable negative downstream side effects (i.e. rows being left in the dataframes, etc.).
Issues being resolved in this PR:
In calendar_dates_agencyid, the merge with tmp2 requires service_id, yet it is not provided in the prior dataframes used to create that variable.
Resolving that unveils another issue: The subsitution step in calendar_dates_agencyid includes the conversion of spaces to underscores: sub(r'\s+', '_', col). This errors when performed on a dataframe column. I note that there are examples showing this is doable, perhaps it is not supported on some versions of Pandas. Because we don't specify exactly which version of each dependency we use (the greater than or equal option is used in the requires.txt), we should opt for a more conservative method in this step.
Resolving that step reveals a number of attempted merges on Series instead of dataframes. For example, this line in calendar_agencyid: merged_df = pd.merge(calendar_df['service_id'], tmp2, how='left', on='service_id', sort=False, copy=False). calendar_df['service_id'] returns a Pandas Series. It needs to be nested once more to retain the dataframe.
Line 426-5 has the same issue of their not being a service_id included in the initial few dataframes, as well.
stop_times_agencyid function attempts to merge a Series (trips_df['trip_id']) with a DataFrame, as well.
Also, issue here (with same code from Item 5) is that you can't left join on route_id if it is not included at all in one of the dataframes.
The elif len(agency_df['agency_name']) > 1: logic segment never udpates the overall df_list global, resulting in a KeyError when looking for unique_agency_id in the next step.
As I attempted to resolve the issue I noted in this PR, I "unravelled" a number of components within
utils_format.py
that were broken. This PR fixes the errors insofar as it assumes that resolving the errors does not negatively affect the quality or accuracy of the output dataframes.I noticed there is a Travis file. Also, looking through the tests, it looks like this segment my PR covers is missing tests. I'll make an effort to produce some tests to cover this component as well so that, in the future, modifications can be made on these components with confidence.
That said, I wanted to open the PR right now to "get the discussion going" and also ensure that my modifications were not in anyway creating observable negative downstream side effects (i.e. rows being left in the dataframes, etc.).
Issues being resolved in this PR:
In
calendar_dates_agencyid
, the merge withtmp2
requiresservice_id
, yet it is not provided in the prior dataframes used to create that variable.Resolving that unveils another issue: The subsitution step in
calendar_dates_agencyid
includes the conversion of spaces to underscores:sub(r'\s+', '_', col)
. This errors when performed on a dataframe column. I note that there are examples showing this is doable, perhaps it is not supported on some versions of Pandas. Because we don't specify exactly which version of each dependency we use (the greater than or equal option is used in therequires.txt
), we should opt for a more conservative method in this step.Resolving that step reveals a number of attempted merges on Series instead of dataframes. For example, this line in
calendar_agencyid
:merged_df = pd.merge(calendar_df['service_id'], tmp2, how='left', on='service_id', sort=False, copy=False)
.calendar_df['service_id']
returns a Pandas Series. It needs to be nested once more to retain the dataframe.Line 426-5 has the same issue of their not being a
service_id
included in the initial few dataframes, as well.stop_times_agencyid
function attempts to merge a Series (trips_df['trip_id']
) with a DataFrame, as well.Also, issue here (with same code from Item 5) is that you can't left join on
route_id
if it is not included at all in one of the dataframes.The
elif len(agency_df['agency_name']) > 1:
logic segment never udpates the overalldf_list
global, resulting in aKeyError
when looking forunique_agency_id
in the next step.