cpa-analytics / embedding-encoder

Scikit-Learn compatible transformer that turns categorical variables into dense entity embeddings.
MIT License
41 stars 7 forks source link

Not clear how to save and load the trained embedding model #22

Closed KyleeValencia closed 2 years ago

KyleeValencia commented 2 years ago

Hello I want to ask how to save the model in my disk and after that I want to use the model on other time by load the model. Is it using mapping_to_json as the function or I need to save it like any other keras model saving method where the model are specified in ee._model ? And what about if I want to reload it ?

Some beginner like me having a hard time to understand the documentation. And I hope save and reload model instruction and code example can be added to documentation since its pretty crucial for some user like me.

Thank you and keep the good work :>

rxavier commented 2 years ago

Hi @KyleeValencia. Basically on initialization you set mapping_path to a JSON file (it doesn't need to exist, just like when you use to_csv() in pandas for example). For your use case set pretrained=False, EE will run the encoding process as normal and save the encoding lookup table to mapping_path.

If pretrained=True, EE will not train anything, look for a JSON where you specified and load that for transformations; if there's nothing, it'll throw an error. This is what you do after you've trained at least once and saved the JSON.

KyleeValencia commented 2 years ago

@rxavier I tried to transform the data from pretrained EE and it give me error like this image

And this is my code

# List of categorical_column_name that I need to embed
cat_cols = list(X_train[(X_train.dtypes=='object').index].columns)

#Embedder Initialization
Embedding_Categorical = EmbeddingEncoder(task='classification',
                                         keep_model=True, 
                                         mapping_path="./Embed_TF_Mushromm_Categorical_Data.json")

# Fitting EE
Embedding_Categorical.fit(X_train[cat_cols],Y_train)

# Test to transform
testTransform = Embedding_Categorical.transform(X_train[cat_cols])

# Test to load model from json file
testLoad = EmbeddingEncoder(task = 'classification', 
                            pretrained = True, 
                            mapping_path='./Embed_TF_Mushromm_Categorical_Data.json')

# Test to transform data from loaded EE model
testTransform_tf = testLoad.transform(X_train[cat_cols])
rxavier commented 2 years ago

You need to fit() first. This is because scikit-learn always tries to fit, so it needs to be called or it wouldn't work in Pipelines for example.

KyleeValencia commented 2 years ago

@rxavier It works ! Thank you for the guidance 👍 image