Can Shparkley package generate shap values for an entire validation dataset?

saichaitanyamolabanti commented 2 years ago

As observed in the simple.ipynb file, Shparkley package has generated shap values for a single datapoint, so I wanted to check whether If we input several rows to be investigated, does shparkley provides shap values for all rows?

current: query_row = Row(fico=600, loan_amount=300, number_of_delinquencies=1, repaid_all_previous_affirm_loans=0) shapley_values_shparkley = compute_shapley_for_sample( df=train_spark_df, model=model_with_shparkley_interface, row_to_investigate=query_row, )

Expected: query_rows = Row(fico=600, loan_amount=300, number_of_delinquencies=1, repaid_all_previous_affirm_loans=0); Row(fico=700, loan_amount=350, number_of_delinquencies=0, repaid_all_previous_affirm_loans=0); Row(fico=680, loan_amount=370, number_of_delinquencies=1, repaid_all_previous_affirm_loans=1); shapley_values_shparkley = compute_shapley_for_sample( df=train_spark_df, model=model_with_shparkley_interface, row_to_investigate=query_rows, )

saichaitanyamolabanti commented 2 years ago

@kevinwang @variablenix @ijoseph @prasad-kamat please help

ijoseph commented 2 years ago

The current algorithm is optimized to explain one row at a time: https://github.com/Affirm/shparkley/blob/9e1c72cad8f1b4a46a0f375fcd6144020dedc4e4/affirm/model_interpretation/shparkley/spark_shapley.py#L46

, so the best way to execute your above API requirements at this point would be serially

shapley_values_shparkley = []
for query_row in query_rows:
    shapley_values_shparkley.append(
        compute_shapley_for_sample(
            df=train_spark_df,
            model=model_with_shparkley_interface,
            row_to_investigate=query_row,
        )
    )

saichaitanyamolabanti commented 2 years ago

okay thanks @ijoseph

Affirm / shparkley

Can Shparkley package generate shap values for an entire validation dataset? #6