entron / entity-embedding-rossmann

872 stars 328 forks source link

Cannot reproduce meaningful embedding #10

Closed vanduc103 closed 6 years ago

vanduc103 commented 6 years ago

Dear Entron, I downloaded the kaggle branch and trained with test_models.py file (default option: 1 network, train_ratio = 0.97). But I cannot see the meaningful embeddings as yours. I attached state_embedding with all states in the same distance with each other. Can you tell me why it is ? Thank you very much ! P/S: I used keras 1.2.2, tensorflow r0.10 with GPU. I got "Result on validation data: 0.10472426564821177" and I saved trained model by keras.save() (I could not save by pickle.dump because of error as in issue #9) state_embedding

entron commented 6 years ago

This is strange indeed. It seems the state embeddings were not learned. On the other hand you got good validation results, so the state embedding should have been learned. Is the saved data correct? Is the saved data interpreted by the plotting functions correctly?

vanduc103 commented 6 years ago

Thank you for your reply. I tried to do as in your master branch by saving embedding weights directly to a pickle file:

Save embedding

    saved_embeddings_fname = "embeddings.pickle"
    weights = self.model.get_weights()
    store_embedding = weights[0]
    dow_embedding = weights[1]
    year_embedding = weights[4]
    month_embedding = weights[5]
    day_embedding = weights[6]
    german_states_embedding = weights[20]
    with open(saved_embeddings_fname, 'wb') as f:
        pickle.dump([store_embedding, dow_embedding, year_embedding,
                    month_embedding, day_embedding, german_states_embedding], f, -1)

And in embedding visualization, I loaded as: with open("embeddings.pickle", 'rb') as f: [store_embedding, dow_embedding, year_embedding, month_embedding, day_embedding, german_states_embedding] = pickle.load(f)

But I still had the same result. Do you think how can the plotting function incorrectly interpret the embedding weights ?

tsne = manifold.TSNE(init='pca', random_state=0, method='exact') Y = tsne.fit_transform(german_states_embedding) states_names = ['Niedersachsen', 'Hamburg', 'Thueringen', 'RheinlandPfalz', 'SachsenAnhalt', 'BadenWuerttemberg','Sachsen', 'Berlin', 'Hessen', 'SchleswigHolstein', 'Bayern', 'NordrheinWestfalen'] plt.figure(figsize=(8,8)) plt.scatter(-Y[:, 0], -Y[:, 1]) for i, txt in enumerate(states_names): plt.annotate(txt, (-Y[i, 0],-Y[i, 1])) plt.savefig('state_embedding.png')

Thank you very much !

entron commented 6 years ago

Have you made any changes to the code?

vanduc103 commented 6 years ago

Hi, I attach 2 files that I made some update (mostly for the problem cannot dump the keras model to a pickle file). I show the comment at the places I made changes. It is very appropriate if you can help me to figure out the problem. Thank you very much ! changed_files.zip

vanduc103 commented 6 years ago

If possible, could you share your trained model or embedding weights file ? Thank you very much !

entron commented 6 years ago

I have updated the code to use the newest keras. Could you try that and check whether you still have the problem? The branch is at: https://github.com/entron/entity-embedding-rossmann/tree/keras2

Edit: I have merged this the keras2 branch to master.

vanduc103 commented 6 years ago

Hi Entron, Thank you very much for helping me to figure out this problem. I tried with your keras2 branch and got the embedding file. At first, the embedding visualization was still no meaningful but after checking carefully by computing Euclidean distances between some original state_embedding points, I found that maybe the problem is TSNE algorithm which may not work correctly with small dataset. So I changed to use PCA directly as: from sklearn.decomposition import PCA pca = PCA(n_components=2) Y = pca.fit_transform(german_states_embedding) and finally I got the meaningful result of state_embedding (see attach file). For store embedding I still use TSNE and got my result as attached file. Could you tell me it is reasonable ? I want to thank you again for your kindly helping and I found your research is a very interesting one for categorical variables ! state_embedding store_embedding

entron commented 6 years ago

Looks good, glad it worked! Berlin and Hamburg are two city states which are different from the rest. The 3 states from previous eastern Germany are on the very left side which is also good. Btw, the store embedding 2D/3D visualization (including those generated by myself) are not as meaningful as the German States embeddings. I remembered I figured out the reason after I wrote the paper one day, but I can't remember it clearly now. Maybe it is because there is a minimum embedding dimension to preserve the metric well enough and 2 for the store metric space is too small.