eabdullin / Word2Vec.Net

implementation Word2Vec for .Net framework
Other
127 stars 41 forks source link

How do I find similar words? #6

Closed mongrel73 closed 8 years ago

mongrel73 commented 8 years ago

Your readme explains how to use input data to create vectors and write them to a txt file.

        var word2Vec = Word2VecBuilder.Create()
            .WithTrainFile("trainingFile.txt")
            .WithOutputFile("outputFile.txt")
            .Build();

        word2Vec.TrainModel();

This works - at least, outputFile.txt is created and it seems to be full of vectors - but I now want to use "outputFile.txt" to find words similar to "Texas". How do I do that?

The full program I'm trying is pasted below. using distance.Search gives me and empty array, and using analogy.Search gives me results, but the "Word" property on each "BestWord" is a number, followed by null-terminating operators:

        const string inputFile = @"C:\temp\word2vec\data.txt";
        const string outputFile = @"C:\temp\word2vec\output.txt";

        var word2vec = Word2VecBuilder.Create()
            .WithTrainFile(inputFile)
            .WithOutputFile(outputFile)
            .Build();

        word2vec.TrainModel();

        var distance = new Distance(outputFile);
        BestWord[] bestwords = distance.Search("Texas");

        var analogy = new WordAnalogy(outputFile);
        bestwords = analogy.Search("Texas");

Is there a simple way to input "Texas", and output ["Arizona", "Oklahoma", "Kansas"] etc.?

eabdullin commented 8 years ago

Hi, You have to add one more parameter to word2vec: 'binary'

var word2vec = Word2VecBuilder.Create()
            .WithTrainFile(inputFile)
            .WithOutputFile(outputFile)
            .WithBinary(1);
            .Build();

then use

var distance = new Distance(outputFile);
        BestWord[] bestwords = distance.Search("Texas");

'Analogy' needed to search text analogies e.g. 'usa washington russia' -> moscow

P.S. make sure that you have enough data. 2 mln words or more text is preferred P.P.S. i recommend you make convert all text to lowercase, because "Texas" and "texas" will be different tokens

eabdullin commented 8 years ago

@mongrel73, in main page I've described all of word2vec parameters more explicitly. you can configure word2vec for your own task. E.g. may be size of word vectors(features of word) will be useful for you