VHRanger / nodevectors

Fastest network node embeddings in the west
MIT License
505 stars 59 forks source link

Load into W2V does not work #23

Closed MrPaulAlbert closed 3 years ago

MrPaulAlbert commented 3 years ago

Awesome work! Unfortunately, when I load my bin file, I get the following error message: ValueError: invalid vector on line 0 (is this really the text format?)

Any suggestions? There are spaces in the node names (e.g., 'Leonardo da Vinci').

VHRanger commented 3 years ago

How are you saving and loading this bin file?

What methods and code did you use to be exact?

Looking at your bin file, there are spaces in the node names, but they're not limited by quotes, so it breaks the pseudo csv format embedding model files normally use (the separator between numbers in a line and the spaces in "The New York Times" is the same).

Ideally you'd put underscores in multi-word entities (what word2vec-googlenews does), or some other separator than the space you have in the number vector.

You could just fix the trained model file you have linked with a quick python or shell script, replacing the spaces in each line before you hit a number with some other char (or some quick script of the sort).

VHRanger commented 3 years ago

Hey,

I don't see your notebook attached but I'd be greatly interested in what you're doing. I imagine you're building a knowledge graph embedding from wikipedia articles and the links between them?

Not completely sure how to access the bin file in order to modify my node names.

Your bin file is actually just a text file with the first line being the number of graph nodes/embedding size and each line after is "node_name vec1 vec2 vec3" where node_name is the name of your concept (United States, World War II are the first ones) and vecN is the Nth embedding value.

Doing head wheel_model.bin on your file looks like this:

4806237 32
United States 6.8033214 -4.154616 0.012066513 11.332104 -16.76169 -19.428492 -10.781682 1.6716479 -0.19667558 1.3256662 -3.5415244 2.237211 -13.055762 5.908908 -2.7512574 14.582257 -0.12124324 -10.849494 -16.312693 0.2916756 -0.026202707 -1.9240215 12.621503 5.048701 -4.752299 2.0447419 0.21070565 -3.0716078 -1.6428103 10.187764 10.904518 14.997075
United Kingdom -3.7103581 -9.126546 1.4116981 -0.056045603 -2.0448332 -1.1038564 -0.86103773 -6.0579333 -2.976625 12.728374 -5.2228265 -4.3223863 -5.1425834 5.964745 -8.878074 -13.255045 -3.061953 8.424931 -4.71807 -2.1219532 -19.40228 -11.125764 4.3466306 11.592513 -20.929165 -6.405772 -2.067156 20.383396 -0.61012983 -6.948416 8.447513 -8.711377
World War II -2.8781173 4.932038 -16.562336 -1.0972513 10.062222 -1.1481029 -13.783848 -0.47825798 -2.9046717 3.2946844 -4.153537 0.7279581 1.5258105 -0.54257464 -6.9199524 6.8763733 -20.364853 9.290325 3.9864638 -4.5167356 -0.2601528 3.492558 13.922298 -4.532118 -4.7888575 -21.872889 1.8821391 -2.4022622 9.867455 4.495968 29.433992 -6.4929194
Germany 0.6580372 -6.628668 -0.43459854 -1.6681336 -1.8274205 -6.296602 -9.535266 -7.501447 1.0744739 2.68418 21.059107 -8.0889635 10.379657 -9.315827 -2.6443145 -14.056111 -4.5304785 8.186242 2.2545314 11.87788 1.93783 -13.7836075 4.9945726 -2.8565195 10.113838 -22.338263 13.395137 -13.977517 5.5553727 -0.88845044 10.984225 -5.724566

This means an easy way to modify it would be to do it with line iteration in python. Here's a sketch of a quick script I would write, using the fact that there's 32 dimensions to your vectors to split each line:

with open('new_model.bin', 'w' as fout:
    with open('wheel_model.bin', 'r') as fin:
        lines = fin.readlines()
        fout.write(lines[0])
        lines = lines[1:] # skip first line
        for l in lines:
            words = l.split(' ')
            vector = words [-32:]
            concept = [:-32]
            concept = '_'.join([x for x in concept])
            vector = ' '.join([x for x in vector])
            new_line = concept + ' ' + vector
            fout.write(new_line)

There might be bugs in this script, it's just a quick sketch of what I would do. But it would "fix" your currently broken file.

Just to be confirm, since my graph is directed, this limits Nodevector walks correct? I’m sort of assuming I want to use a directed graph to try to embed Wikipedia articles.

Correct, the random walks will only take steps in the direction of directed edges for directed graphs. This is true both if you used NetworkX or the CSRGraphs backend to load the graph.

On my first run, I just went with default settings. 3. Any suggestions you might have on that?

Depends which model you're using.

If you're using Node2Vec you should play with walk length and window size especially. Longer walk lengths and larger windows train slower but create "deeper" embeddings. Touching the return_weight and neighbor_weight will make training drastically slower for large graphs and doesn't gain much performance. I don't recommend it.

You can also try other algorithms. GGVec (which is my creation) can be tried with order = 1 (faster, cruder) and order = 2 (much slower and much deeper) and the other parameters as recommended in the README. Another good one for large graphs is ProNE, and hyperparamaters don't change much on that one.

MrPaulAlbert commented 3 years ago

Fixed bin file to avoid spaces in node names. Working great!

Congrats on this package. Both Stanford SNAP C++ and Python Node2Vec choked on this dataset after running for days. Nodevector successfully completed task in 18 hours.

VHRanger commented 3 years ago

Good to hear!

I'll close the issue.