Closed tomglk closed 12 months ago
Merging #836 (ac43508) into master (4a1b4f7) will decrease coverage by
0.07%
. The diff coverage is100.00%
.
Hi @tomglk, thank you for contributing this.
Are you able to share why you need to serialise/deserialise dtype='object'
ndarray's here? We'd be interested to know since this isn't a use case we had envisioned (although that doesn't mean it isn't valid!), and want to be sure there is strong motivation for allow_pickle=True
due to the associated CVE-2019-644.
Hi @ascillitoe , I know that allow_pickle = True has bad security implications. I would welcome any other solution.
We store strings, that seems to be the problem.
This is a minimal example that breaks:
data = pd.DataFrame(
columns=["c1", "c2", "c3"],
data=[
["42", 1, datetime.now().timestamp()],
["42", 2, datetime.now().timestamp()],
["42", 3, datetime.now().timestamp()],
]
)
detector = TabularDrift(
x_ref=data.to_numpy(),
p_val=0.05,
x_ref_preprocessed=True
)
save_detector(
detector,
"detector"
)
loaded_detector = load_detector(
"detector"
)
Changing "42" to 42 fixes it.
It also breaks when you remove the .timestamp()
, but storing a timestamp instead of a datetime object is not a problem imo.
Edit:
Setting the c1 column as category-type & adding the param categories_per_feature={0: None}
also does not change the behaviour.
New commit does not set the value to true all the time (because that is unnecessary, if you do not have the problem). I see that the first commit was overly aggressive in that regard.
Instead, the value can now be controlled via parameter.
Intended usage:
loaded_detector = load_detector(
"detector",
enable_unsafe_loading=True
)
This way users who actually have the problem do not have to hack the np.load()-function themselves. But they are forced to make a conscious and informed decision (CVE you mentioned is linked in docstrings).
Tests: I did not touch the tests, because I cannot get the project running locally. I can try to make it work next week. Or look into test cases without running anything, but that would be mildly horrible.
Hi @ascillitoe,
I added some tests for load_detector(). Also demonstrating that the problem occurs and can be prevented with the new parameter.
What do you think? Is there anything else that is missing or could the code be merged like this?
@tomglk apologies for the late reply. We were initially hesitant about this change as it essentially means implicitly supporting data types in numpy
arrays that we have been cautious to support. We would like to at some point rethink the data formats that we officially support (with examples, docs and everything), but that's currently not a priority. Since this unblocks your use case we're happy to merge it (will likely have a patch release out within a couple of weeks). Thanks for your contribution!
Hi, we had had problems loading a detector which we stored using save_detector.
When we want to load it using load_detector, we get the error
value error: 'Object arrays cannot be loaded when allow_pickle=False'
Currently we apply a hack of overwriting
np.load
with the parameterallow_pickle=True
in order to be able to use the loading mechanism. (https://stackoverflow.com/a/56243777)However, we think that it makes sense to set this param in the loading function itself.
If you have any questions, feel free to ask. Any tips on how to solve this issue with other ways are also more than welcome. :)
Thanks, Tobi & Tom
Co-authored-by: tokaessm