SeldonIO / alibi-detect

Algorithms for outlier, adversarial and drift detection
https://docs.seldon.io/projects/alibi-detect/en/stable/
Other
2.21k stars 220 forks source link

Allow pickle when loading numpy array file #836

Closed tomglk closed 12 months ago

tomglk commented 1 year ago

Hi, we had had problems loading a detector which we stored using save_detector.

When we want to load it using load_detector, we get the error value error: 'Object arrays cannot be loaded when allow_pickle=False'

Currently we apply a hack of overwriting np.load with the parameter allow_pickle=True in order to be able to use the loading mechanism. (https://stackoverflow.com/a/56243777)

However, we think that it makes sense to set this param in the loading function itself.

If you have any questions, feel free to ask. Any tips on how to solve this issue with other ways are also more than welcome. :)

Thanks, Tobi & Tom

Co-authored-by: tokaessm

codecov[bot] commented 1 year ago

Codecov Report

Merging #836 (ac43508) into master (4a1b4f7) will decrease coverage by 0.07%. The diff coverage is 100.00%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/SeldonIO/alibi-detect/pull/836/graphs/tree.svg?width=650&height=150&src=pr&token=gwntCwhaGT&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=SeldonIO)](https://app.codecov.io/gh/SeldonIO/alibi-detect/pull/836?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=SeldonIO) ```diff @@ Coverage Diff @@ ## master #836 +/- ## ========================================== - Coverage 81.98% 81.91% -0.07% ========================================== Files 159 159 Lines 10375 10375 ========================================== - Hits 8506 8499 -7 - Misses 1869 1876 +7 ``` | [Files](https://app.codecov.io/gh/SeldonIO/alibi-detect/pull/836?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=SeldonIO) | Coverage Δ | | |---|---|---| | [alibi\_detect/saving/loading.py](https://app.codecov.io/gh/SeldonIO/alibi-detect/pull/836?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=SeldonIO#diff-YWxpYmlfZGV0ZWN0L3NhdmluZy9sb2FkaW5nLnB5) | `93.92% <100.00%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/SeldonIO/alibi-detect/pull/836/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=SeldonIO)
ascillitoe commented 1 year ago

Hi @tomglk, thank you for contributing this.

Are you able to share why you need to serialise/deserialise dtype='object' ndarray's here? We'd be interested to know since this isn't a use case we had envisioned (although that doesn't mean it isn't valid!), and want to be sure there is strong motivation for allow_pickle=True due to the associated CVE-2019-644.

tomglk commented 1 year ago

Hi @ascillitoe , I know that allow_pickle = True has bad security implications. I would welcome any other solution.

We store strings, that seems to be the problem.

This is a minimal example that breaks:

data = pd.DataFrame(
    columns=["c1", "c2", "c3"],
    data=[
        ["42", 1, datetime.now().timestamp()],
        ["42", 2, datetime.now().timestamp()],
        ["42", 3, datetime.now().timestamp()],
    ]
)

detector = TabularDrift(
    x_ref=data.to_numpy(),
    p_val=0.05,
    x_ref_preprocessed=True
)

save_detector(
    detector,
    "detector"
)

loaded_detector = load_detector(
    "detector"
)

Changing "42" to 42 fixes it.

It also breaks when you remove the .timestamp(), but storing a timestamp instead of a datetime object is not a problem imo.

Edit: Setting the c1 column as category-type & adding the param categories_per_feature={0: None} also does not change the behaviour.

tomglk commented 1 year ago

New commit does not set the value to true all the time (because that is unnecessary, if you do not have the problem). I see that the first commit was overly aggressive in that regard.

Instead, the value can now be controlled via parameter.

Intended usage:

loaded_detector = load_detector(
    "detector",
    enable_unsafe_loading=True
)

This way users who actually have the problem do not have to hack the np.load()-function themselves. But they are forced to make a conscious and informed decision (CVE you mentioned is linked in docstrings).

Tests: I did not touch the tests, because I cannot get the project running locally. I can try to make it work next week. Or look into test cases without running anything, but that would be mildly horrible.

tomglk commented 1 year ago

Hi @ascillitoe,

I added some tests for load_detector(). Also demonstrating that the problem occurs and can be prevented with the new parameter.

What do you think? Is there anything else that is missing or could the code be merged like this?

jklaise commented 12 months ago

@tomglk apologies for the late reply. We were initially hesitant about this change as it essentially means implicitly supporting data types in numpy arrays that we have been cautious to support. We would like to at some point rethink the data formats that we officially support (with examples, docs and everything), but that's currently not a priority. Since this unblocks your use case we're happy to merge it (will likely have a patch release out within a couple of weeks). Thanks for your contribution!