ColtAllen / btyd

Buy Till You Die and Customer Lifetime Value statistical models in Python.
https://btyd.readthedocs.io/
Apache License 2.0
114 stars 9 forks source link

Persist Full InferenceData object as JSON #29

Closed ColtAllen closed 2 years ago

ColtAllen commented 2 years ago

This is a fairly straightforward task that will go a long way towards improving model functionality and maintainability of the code base.

Modules: lifetimes.models.__init__.BaseModel class object

Issue: An ArviZ InferenceData object is created as a model attribute whenever model.fit() is called. Currently model persistence entails extracting model parameters from this attribute and dumping them into a memory-optimized JSON file. However, once this JSON file is loaded into a model, ArviZ plotting and statistical functions are no longer supported. The pre/post-processing code to format this JSON also adds unnecessary complexity to the BaseModel class and could make future maintenance more difficult. Plus let's be honest, this isn't a 350GB NLP model; reducing a <10 MB InferenceData object down to a <4 MB JSON is not worth the hassle.

Work Summary: Replace JSON formatting code in _unload_params() , fit(), save_params() and load_params() with ArViz methods like arviz.InferenceData.to_json() and arviz.from_json().

https://arviz-devs.github.io/arviz/api/data.html

remove_hypers can also be removed as a model class attribute, and I'm not opposed to renaming save_params() and load_params() to save_model() and load_model() either.

Other Comments: JSON is the preferred format for model persistence. Pickle files have their place for the fast read/writes demanded of online learning and passing objects between CPU threads, but the added complexity of their implementation just isn't worth it for a model that is only saved & loaded one time. They are also a security risk since malware can be obscured in a pickle format. I could totally see a hacker with prior system access overwriting a .pkl model file with an executable that exfiltrates customer IDs whenever the model is ran.