Closed imatheussm closed 2 years ago
Thanks for reporting this.
I will look into making Ivis models serializable using pickle, is sounds like a useful feature if we can do it. Hopefully that will solve the issues you're encountering.
I was coding a solution here, which consisted in renaming the currently defined Ivis.__getstate__()
to Ivis._get_json()
and creating new Ivis.__getstate__()
and Ivis.__setstate__()
methods. The solution consisted in the following:
Ivis.__getstate__()
, the internal dictionary state
would be obtained. However, for the encoder_
, model_
, neighbour_matrix_
, and supervised_model_
attributes, the following procedure would be adopted: .save()
would be called for each of them, generating a annoy.index
file for neighbour_matrix_
and .h5
files for the rest. Then, using the built-in base64
module, these files would be encoded into a Base64 string, which would be attributed to the respective parameters.Ivis.__setstate__()
, the same four parameters described above would be checked. If present, base64
would be used to decode them into the binary files they originally were. Then, they would be used to load the original parameters back into their respective attributes.However, I stumbled in a problem: when calling Ivis.neighbour_matrix_.save(<path>)
, or even Ivis.neighbour_matrix_.index.save(<path>)
, no file was produced, regardless of path type (absolute or relative) or file name. I am currently assuming this is an annoy
-related issue.
I am just posting this in case this reasoning is helpful to you.
The issue
A fitted
Ivis
instance is not adequately preserved whenjoblib.dump()
is used to save it. Consequently, whenIvis
is used as part of asklearn.pipeline.Pipeline
object withmemory != None
, errors occur.Minimal reproducible examples
Two examples are provided herein: one with
sklearn.pipeline.Pipeline
, and other withjoblib
only (sklearn
usesjoblib
insklearn.pipeline.Pipeline
, so I thought this second example could help).Environment
A virtual environment was created specifically for this project, wherein all modules specified in requirements.txt were installed. My setup runs an up-to-date version of Windows 10 (no WSL).
Runtime
Relevant modules
Example with
sklearn.pipeline.Pipeline
Script
Log with errors
Example without
sklearn.pipeline.Pipeline
Script
Log with errors
Discussion
As seen in the example with
sklearn.pipeline.Pipeline
andsklearn.model_selection.GridSearchCV
, everything runs smoothly whenIvis
is fitted the first time for all folds. When the model is cached and retrieved for the subsequent runs, however, errors happen because at leastIvis.encoder
is missing. Upon experimentation, it was found that even after loadingIvis.encoder
, errors happened with the reloaded model, indicating that other important attributes were not properly pickled.Although I never tested such functions, it seems that saving and loading capabilities were already developed for
Ivis
inIvis.save_model()
andIvis.load_model()
. However, to ensure thatIvis
is pickleable, it would be ideal to transfer and adapt this functionality toIvis.__getstate__()
andIvis.__setstate__()
(the latter of which does not exist AFAIK) so thatpickle
andjoblib
know how to pickle anIvis
instance. This would enable its employment inPipeline
objects withmemory != None
, thus significantly speeding up the hyper-parameter fine-tuning process performed byGridSearchCV
.