Closed bpblakely closed 4 years ago
@bpblakely this is awesome.
The core functionality looks awesome, great work!
I have a few comments (also for @Trippl7777 ) regarding the overall structure/organization of unpack_json()
, unpack_json_and_merge()
, process_unpack_json()
, unpack_json_fast()
, unpack_json_and_merge_fast()
, process_explode_json_array()
, explode_json_array_fast()
with the general goal to reduce duplicated code and streamline.
great simple fix to build a json.loads that handles missing data. it is used in many places, so maybe we should make a simlpe one line function.
# json.loads() but handling of missing/nan/non-string data.
def load_json_nan(df, json_col):
return df[json_col].apply(lambda x: json.loads(x) if type(x) == str else x)
you would use it like this:
df[json_column + '_dict'] = load_json_nan(df, json_column)
And then we use that function in unpack_json()
, explode_json_array()
, process_unpack_json()
, unpack_json_and_merge_fast()
.
The reasoning is it makes it slightly more readable, it avoids possible errors when copy/pasting/duplicating code AND if we ever want to augment our error handling or missing data handling (e.g., in future maybe we should not just check for string, but we should check that it is a valid JSON?) we can do it in a centralized function.
Then we can also delete the lines where we explicitly remove missing JSON, and also delete the comments about it, unless you think there is some reason to keep them.
It seems to me that unpack_json()
and process_unpack_json()
do basically the same thing. Is there a reason we cannot simply modify unpack_json()
to retain original functionality but meets the _fast needs, rather than create a highly redundant function process_unpack_json()
? Then unpack_json_fast()
is ultimately a wrapper of unpack_json()
which seems to make sense to me.
@Trippl7777 I've just noticed (correct me if I'm wrong) that unpack_json_and_merge()
does not need to have this logic:
if (key_col_name is None):
key_col_name = json_column + '_key'
if (value_col_name is None):
value_col_name = json_column + '_value'
Because it will "inherit" that functionality when it calls unpack_json()
. Therefore we can remove this redundant code from unpack_json_and_merge()
Similarly, @bpblakely, the function unpack_json_and_merge_fast()
does not need that logic, because that functionality will be achieved when it calls unpack_json_fast()
. So we can remove that logic from unpack_json_and_merge_fast()
.
Unless I'm mistaken, I think all of the above comments can also apply to the array
functions:
explode_json_array()
is basically identical to process_explode_json_array()
, so let's only use one function that works standalone, and also can be called by explode_json_array_fast()
. explode_json_array()
and does not need to exist in explode_json_array_fast()
, right? How important are the del df
commands in explode_json_array_fast()
and unpack_json_fast()
? do they really improve performance? When we run the chunks_list = [df[i:i+chunk_n] for i in range(0,df.shape[0],chunk_n)]
I'm not sure we are making new deep copies of the chunks, versus just making "views" or references to them. If we are just making references to the original object, then the del command may not make a difference because there are still existing references to it and/or may have unintended consequences. I'm not 100% sure, but I don't think the del commands are actually doing anything here and maybe we can remove them. If you have evidence that it's making a difference on memory, please let me know
1 That's no big deal and should be fine
2 & 4
process_unpack_json
is actually the function that does all of the work! unpack_json
is a function that more or less sets up the multithreaded computation for you. That function distributes partitions of the dataframe to process_unpack_json
(this function is what is actually ran in parallel). If you try to hide the function inside of the other function like normal
def unpack_json_fast(./.):
def process_unpack_json(...):
You will get an error because multithreading is weird? Not to sure about why it doesn't work, I just have a work around for it. Ideally process_unpack_json
would be a private function that is inaccessible to a user, but with python you can only define a function as private in the context of a class (unless there is another way I'm not aware of). So by just omitting the documentation I think we more or less achieve the same result of saying that the function isn't supposed to be used by the user. I'll throw in a comment above the processing functions to state they are for multithreading and shouldn't be called directly.
5 It should be the case that it is just a reference, but while testing it felt like there was way too much memory being used while running in parallel. and deleting when dataframes when unneeded more or less solved it (not all deletes may be impactful, but they don't really hurt). If we weren't making deep copies then I feel like modifying the chunks of the dataframe would affect the original dataframe, however that isn't the case. I tested this by doing
with Pool() as pool:
pool.map(function,chunks_list)
return df
instead of
with Pool() as pool:
results = pool.map(function,chunks_list)
return pd.concat(results)
The result is just the normal, unmodified dataframe.
Another note on your 2nd comment
You can modify the normal functions to use the process_ functions to reduce redundancy.
I am actually not sure. I suppose it should inherit it, but I will test it tomorrow morning in a colab and make sure before I change the actual file. Maybe do a hotfix tomorrow morning.
@bpblakely / 1. OK thanks i'll look for this change
/ 5. I think that's because the function
you are mapping across chunklists RETURNS dataframes. It doesn't alter the inputs inplace. So if you don't assign the output to a new variable, the results of those function calls are just being computed, not assigned to anything, and dropped. I'm not sure--i see your point, and I also am a little confused about exactly how it works. I would prefer not to have any del df
that we aren't SURE are helping. I don't mind if they are helping, but I would prefer not to have them unless we are sure they are helping.
an easy way to know if they are helping would be to ride a side-by-side test of both versions with and without a particular del df
command on the same data using %%time or some other benchmarking metric.
/3 -- there was an action item for you in here also. OK?
/ 2 & 4: I think you misunderstand me.
You write:
unpack_json is a function that more or less sets up the multithreaded computation for you.
No, unpack_json_fast
does that, right? unpack_json
is the original function.
What I'm saying is that the original function unpack_json
and your new function process_unpack_json
seem to basically do the same thing. Can we just modify the original function unpack_json
to meet your the needs of unpack_json_fast()
, while retaining all of its original functionality for users? Put another way, in the PR as currently stands, the functions unpack_json
and process_unpack_json
seem highly redundant. Instead of having 3 functions unpack_json()
, process_unpack_json()
and unpack_json_fast()
can we just have 2 functions unpack_json()
and unpack_json_fast()
? I don't see why you need to have a separate "private" function that is different than unpack_json
, but please explain if I am missing something.
Otherwise, I think this looks great, just a few cosmetic comments mentioned in-line above. Super excited for this. ready to merge after those last couple changes.
I think it's more or less ready to go. Let me know if you have any other minor things you want to change for clarity.
Pushed out the final changes! Should be good
Added multithreaded functions for the normal json functions