eliorc / node2vec

Implementation of the node2vec algorithm.
MIT License
1.2k stars 245 forks source link

Train vs Inference methods #107

Open priamai opened 6 months ago

priamai commented 6 months ago

Hello there, what is the correct way to separate training from inference?

Is this correct? I run the training first, save the embeddings. Then I load a new graph and do the most similar?

    args = parser.parse_args()

    if args.method=="train":


        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use

        # Save model for later use

    if args.method == "test":
        # now load
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        model = node2vec.fit(window=10, min_count=1, batch_words=4)

        # do some checks

        # Look for most similar nodes
        sim_nodes = model.wv.most_similar('alert--440375ba-c4af-4964-be1e-c6f9906416ff')  # Output node names are always strings

        for node, _ in sim_nodes:
eliorc commented 6 months ago

No, I wouldn't go this way

Training is okay, but for testing you do not need Node2Vec. The algorithm outputs embeddings in a known format, once you're done creating them, you don't need the algorithm again.

So just use

from gensim.models import KeyedVectors

space = KeyedVectors.load_word2vec_format(EMBEDDING_FILENAME)

then too look up vectors, see the gensim docs

priamai commented 6 months ago

Thanks for the reference, following your suggestion is this a valid approach? Does it make sense to save both the wor and model file? Should I just keep the model file only? Why the edges fails to load (see last line) with an error?

    NODE_WORD_FILENAME = "word2vec.emb"
    NODE_MODEL_FILENAME = "word2vec.model"
    EDGES_WORD_FILENAME = "edges2vec.emb"

    if args.method=="train":

        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use

        # Save model for later use

        edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

        # Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
        edges_kv = edges_embs.as_keyed_vectors()

        # Save embeddings for later use

    if args.method == "test":
        import re

        model = Word2Vec.load(NODE_MODEL_FILENAME)
        # this generates an error: could not convert string to float
        edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)
priamai commented 6 months ago

Last error:

  File "/home/robomotic/DevOps/gitlab/ava-prod-ai/venv/lib/python3.11/site-packages/gensim/models/keyedvectors.py", line 1980, in <listcomp>
    word, weights = parts[0], [datatype(x) for x in parts[1:]]
eliorc commented 6 months ago

Which line failes? the edges_kv = or the model =?

priamai commented 6 months ago

Yes is the the keyed vector odd:

this generates an error: could not convert string to float

    edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)
eliorc commented 6 months ago

I can see why this happens, because these are edges embedding

If you want to use edges embedding why not do it this way

node_embeddings = KeyedVectors.load_word2vec_format(NODE_WORD_FILENAME)
edges_embs = HadamardEmbedder(keyed_vectors=node_embeddings)

# Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
edges_kv = edges_embs.as_keyed_vectors()