easeml / datascope

Measuring data importance over ML pipelines using the Shapley value.
https://ease.ml/datascope
MIT License
36 stars 4 forks source link

"compute_all_importances_cy" data type mismatch #3

Open babak-kananpour opened 2 years ago

babak-kananpour commented 2 years ago

Thanks for this great library. I followed the instruction in readme.md file and run the setup.

I get the following error when trying to test on the notebook "DataScope-Demo-1.ipynb":

...\datascope\importance\shapley.py in compute_shapley_1nn_mapfork(distances, utilities, provenance, units, world, null_scores, simple_provenance)
    219     n_test = distances.shape[1]
    220     null_scores = null_scores if null_scores is not None else np.zeros((1, n_test))
--> 221     all_importances = compute_all_importances_cy(unit_distances, unit_utilities, null_scores)
    222 
    223     # Aggregate results.

datascope/importance/shapley_cy.pyx in datascope.importance.shapley_cy.compute_all_importances_cy()

ValueError: Buffer dtype mismatch, expected 'int_t' but got 'long long'

It seems there is an issue with type when calling the compute_all_importances_cy function. It expects integer but receives float(double?).

I tried to modify compute_all_importances_cy in shapley_cy.pyx but I had no luck to fix this bug.

xzyaoi commented 2 years ago

Hi, I just ran the notebook again but I didn't encounter the same issue, could you please try again with the latest notebook in the readme file? If the error persists, I can look deeper into this.

Best regards,

babak-kananpour commented 2 years ago

Hi @xzyaoi , I still have the same problem with the new version. In the demo notebook when the line: importances = importance.fit(X_train_dirty, y_train_dirty).score(X_test, y_test) gets to run I face the error. what I did to ignore this problem was to use the python written function of "compute_all_importances" in "datascope/importance/shapley" instead of cython version "compute_all_importances_cy". These variables are float "unit_distances, unit_utilities, null_scores" which is correct but shapley_cy.pyx expect these variable to be integer.

xzyaoi commented 2 years ago

@babak-1990 Interesting, I still cannot reproduce this error, even with a newly created colab environment (see https://colab.research.google.com/drive/1RdArqm0ZpYR_Tq5rKMDu8U7KxsgNEhNl#scrollTo=8b974636-7c3e-4b82-8401-ff541a47a002).

I am now thinking this is due to your local compiler, which may have a different behavior about np.int (are you on Windows or Mac OS?). I have found a possible solution: https://github.com/eragonruan/text-detection-ctpn/issues/380

However, I don't have a Windows PC at hand, could you please try to change the np.int to np.int64 or np.int32 (if np.int64 does not work out) in this Line https://github.com/easeml/datascope/blob/main/datascope/importance/shapley_cy.pyx#L30? Then after re-compiling, it should work.

If it works please let me know so I can release a stable fix on this. If it doesn't please also feel free to reach out!

Best regards, Xiaozhe

babak-kananpour commented 2 years ago

Hi @xzyaoi, I already read this potential solution and I tried to fix it by assigning different DTYPE but it didn't work out. My local computer OS is windows. you are right this is due to my local compiler. Changing this line https://github.com/easeml/datascope/blob/main/datascope/importance/shapley_cy.pyx#L30? won't fix the problem because the error happens before entering function compute_all_importances_cy in /datascope/importance/shapley_cy.pyx , however I gave it a try to be sure about it.

In that time, I though maybe changing lines https://github.com/easeml/datascope/blob/8a8e397686df318e2e3e5c32b30ba4b80244c522/datascope/importance/shapley_cy.pyx?rgh-link-date=2022-06-17T09%3A39%3A48Z#L11 and https://github.com/easeml/datascope/blob/8a8e397686df318e2e3e5c32b30ba4b80244c522/datascope/importance/shapley_cy.pyx?rgh-link-date=2022-06-17T09%3A39%3A48Z#L12 will fix the problem but it didn't.

xzyaoi commented 2 years ago

@babak-1990, Thanks for your reply! I can reproduce this error with a Windows Setup. I am trying to fix this error, and it should be soon. I will update here if I made a progress :)