explode_json_array error on destination_cbgs

bpblakely commented 4 years ago

If you try to call explode_json_array on destination_cbgs column from the social distancing data set you will get an error when calling pandas explode function

day_visits_exp = df[[place_key, file_key, array_column+'_json']].explode(array_column+'_json')

Error thrown: "KeyError: 0"

This is the error you get when trying to explode a column contain a json that has data stored as {"string" : integer}

I think in our case we get this error because if you try to express 012345 to an integer you get an error because an integer cannot start with a 0.

bpblakely commented 4 years ago

Suggested work around is to call unpack_json_and_merge, when we ask to explode destination_cbgs.

if array_column == 'destination_cbgs' and place_key == 'origin_census_block_group': # hard coded work around
    return unpack_json_and_merge(df,array_column, value_col_name=value_col_name,key_col_name=array_sequence)

We can also try to catch any case where someone tries to call this function with an improper array to explode.

Normally I would say calling explode_json_array with a data column like 'bucketed_distance_traveled' is a mistake on the user. However, it is not very obvious to call unpack_json_and_merge when wanting to explode 'destination_cbgs', since destination_cbgs can be viewed as integers unlike the values in 'bucketed_distance_traveled' clearly being a string

To generalize the approach we can do something like

df[array_column + '_json'] = load_json_nan(df,array_column)
try:
  day_visits_exp = df[[place_key, file_key, array_column+'_json']].explode(array_column+'_json')
except KeyError:
  return unpack_json_and_merge(df,array_column, value_col_name=value_col_name,key_col_name=array_sequence).drop(array_column + '_json',axis=1)

If we go with this approach then we might want to think about adding a parameter to unpack_json_and_merge and unpack_json that lets you specify the column you are passing is already loaded from a json. This way we can avoid calling load_json_nan twice on the same column.

bpblakely commented 4 years ago

@ryanfoxsquire What do you think about this approach to solve the problem? We should be able to correct any issues of someone incorrectly calling explode_json_array by doing this. Thus, solving the issue at hand as well as making it easier to use our functions.

We could also include a print statement that lets the user know when exploding failed and that we are returning unpack_json_and_merge instead. However, we will reintroduce the issue of printing a million times when multithreading, so maybe it's not worth it.

ryanfoxsquire commented 4 years ago

@bpblakely can you provide more details and a minimal example that reproduces the error?

You say

I think in our case we get this error because if you try to express 012345 to an integer you get an error because an integer cannot start with a 0.

Can we confirm that that is exactly what triggers the error you are seeing? And can you post the entire error message you see?

It's still not clear to me why this error happens with SDM.destination_cbg column but not with Patterns.visitor_home_cbg column. I'd like to pinpoint what exactly causes the error and why this is different between SDM and Patterns.

bpblakely commented 4 years ago

@ryanfoxsquire The error happens to me when trying to use Patterns.visitor_home_cbg column. Not sure why it isn't happening for you.

Here's the exact steps I used to get the error.

f_path1 = r'F:\SG\weekly patterns new\patterns\2020\07\01\23\patterns-part1.csv.gz'
df1 = sgpy.read_pattern_single(f_path1)
sgpy.explode_json_array(df1,array_column='visitor_home_cbgs')

Here's the error message:

Traceback (most recent call last):

  File "<ipython-input-29-30dbc702917e>", line 1, in <module>
    sgpy.explode_json_array(df1,array_column='visitor_home_cbgs')

  File "G:\Python File Saves\safegraph_py.py", line 173, in explode_json_array
    day_visits_exp = df[[place_key, file_key, array_column+'_json']].explode(array_column+'_json')

  File "D:\Anaconda\lib\site-packages\pandas\core\frame.py", line 6318, in explode
    result = df[column].explode()

  File "D:\Anaconda\lib\site-packages\pandas\core\series.py", line 3504, in explode
    values, counts = reshape.explode(np.asarray(self.array))

  File "pandas\_libs\reshape.pyx", line 129, in pandas._libs.reshape.explode

KeyError: 0

FYI: This testing is being done on the most recent push of safegraph_py, so our 2 versions should be the same.

ryanfoxsquire commented 4 years ago

@bpblakely Wait, sorry, taking a step back.

Why are you trying to use explode_json_array on either destination_cbg or visitor_home_cbg? I had not looked closely and I assumed this was an error with unpack_json() based on your original description. I apologize for misunderstanding.

explode_json_array is only meant for array columns such as Patterns visits_by_day or popularity_by_hour. These columns contain a string that is an array of numbers with no label or index and are intended to be referenced by their order in the array, e.g.,

popularity_by_hour | A mapping of hour of day to the number of visits in each hour over the course of the date range in local time. First element in the array corresponds to the hour of midnight to 1 am. See also, Places Manual | JSON [Integer] | [ 0, 0, 0, 0, 0, 0, 0, 222, 546, 444, 333, 232, 432, 564, 456, 345, 678, 434, 545, 222, 0, 0, 0, 0 ]

In contrast, unpack_json_and_merge() is meant for columns that actually have key:value pairs, e.g.,

popularity_by_day | A mapping of day of week to the number of visits on each day (local time) in the course of the date range. See also, Places Manual | JSON {String: Integer} | {"Monday": 3300,"Tuesday": 1200,"Wednesday": 898,"Thursday": 7002,"Friday": 5001,"Saturday": 5987,"Sunday": 0}

(quotes from patterns docs page)

destination_cbgs falls into the second category

destination_cbgs | Key is a destination census block group and value is the number of devices with a home in census_block_group that stopped in the given destination census block group for >1 minute during the time period. ... | JSON {String: Integer} | {"130890212162":91,"131210101101":22,"131350502123":20}

Is there some specific reason you are hoping to use explode_json_array() with the destination_cbgs column? It is not the intended use, and I would recommend using unpack_json_and_merge()`. Maybe there is a use case I am not thinking about, or maybe we need to improve our documentation somewhere to avoid this confusion?

bpblakely commented 4 years ago

@ryanfoxsquire The error is because the function is being called on a dictionary, not a list. So, the problem is still using the wrong function for what you want to do. I had someone asking me for help due to this issue and it wasn't obvious why there was an error and that he should be calling the other function. It even got me tripped up for a day trying to fix it. So, I think avoiding the problem and correcting it might be better.

In order to catch this error we can either apply load_json_nan on the first item of the column you provide. After applying load_json_nan you can check if the result is either a dictionary or a list. If it's a dictionary we can call unpack_json_and_merge instead.

ryanfoxsquire commented 4 years ago

More discussion here: https://safegraphcovid19.slack.com/archives/C014FK2QWNL/p1599184256028200?thread_ts=1599087924.021300&cid=C014FK2QWNL

conclusion:

Let's not add code to the repo for the purpose of handling the case of people misunderstanding which function to use, and catching that and redirecting them to the correct function on the backend. We should expect people to use the functions for their intended purposes. If people are not sure which function to use, then then it is our failure in documentation/readme. That problem is better solved with improved documentation, not more code. I think we should consider additional documentation, possibly in the readme, to help people use the right function.

bpblakely commented 4 years ago

Not an error, planning to update docs to avoid confusion of misusing functions

SafeGraphInc / safegraph_py

explode_json_array error on destination_cbgs #29