Affirm / shparkley

Spark implementation of computing Shapley Values using monte-carlo approximation

BSD 3-Clause "New" or "Revised" License

74 stars 15 forks source link

PicklingError: Could not serialize object: TypeError: can't pickle _abc_data objects #5

Open saichaitanyamolabanti opened 2 years ago

saichaitanyamolabanti commented 2 years ago

I wanted to try out this package, because this implements pyspark version of shapley value generations. So, I just copy pasted "simple.ipynb" file into my environment to just observe everything basic is working alright or not, but able to see code is breaking at input cell [32]. Attached are the screenshots, could anyone please look into them?

saichaitanyamolabanti commented 2 years ago

@ijoseph @kevinwang @variablenix @prasad-kamat please help

ijoseph commented 2 years ago

Wow, we really should have pinned (and pip compileed, too) our requirements file below. Let me see if I can get something working and try to update the below. https://github.com/Affirm/shparkley/blob/master/examples/requirements.txt

ijoseph commented 2 years ago

Alright, @saichaitanyamolabanti can you please try to pull this https://github.com/Affirm/shparkley/pull/7 PR, then pip install -r examples/macos-py3.10-requirements.txt if you happen to have macOS and an empty python 3.10 environment, pip install -r examples/requirements.in otherwise? That particular set of third-party requirements worked for me.

saichaitanyamolabanti commented 2 years ago

Hey @ijoseph, I've noticed to install few libraries as per your comments and began installing them, mainly the install and import of cloudpickle. Here are my observations, I can still find some errors, please help !!

Scenario-1 import cloudpickle

import pyspark.serializers

pyspark.serializers.cloudpickle = cloudpickle

then row = dataset.filter(dataset.xxxx == '5').rdd.first() is working fine

Scenario-2: import cloudpickle import pyspark.serializers pyspark.serializers.cloudpickle = cloudpickle then row = dataset.filter(dataset.xxxx == '5').rdd.first() is throwing below error

saichaitanyamolabanti commented 2 years ago

then tried to pull those import of cloudpicklet and spark.serializers down below the investigation row like: row = dataset.filter(dataset.xxxx == '5').rdd.first() import cloudpickle import pyspark.serializers pyspark.serializers.cloudpickle = cloudpickle

but, still able to see error like - cloudpickle doesn't have the method 'print_exec'

saichaitanyamolabanti commented 2 years ago

@ijoseph Or you can consider this scenario: I've tried the same simpl.ipynb example by installing 'cloud pickle' library and by also importing pyspark.serializers and setting up with cloudpickle like below: import cloudpickle import pyspark.serializers pyspark.serializers.cloudpickle = cloudpickle

getting this error, please help !!:

saichaitanyamolabanti commented 2 years ago

@ijoseph @kevinwang @variablenix @prasad-kamat any help ?

m-aciek commented 2 years ago

Isn't it this issue? It looks like it's solved in pyspark 3.0.0 (PR). So maybe it would be enough to set the lower bound for pyspark dependency in setup.py?

REQUIRED_PACKAGES = [
    …,
    'pyspark>=3.0.0',
]