curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

How to get embedding matrix from StarSpace #60

Closed jcperinan closed 2 years ago

jcperinan commented 3 years ago

I would appreciate if you could give an example of the code required to use StarSpace, particularly when mapping a bag of words to a bag of tags, as originally described in:

https://github.com/facebookresearch/StarSpace#tagspace-word--tag-embeddings

Indeed, I'm having trouble when using StarSpace...

Suppose that I have a TXT file where each line contains a set of words that are semantically related to a tag (with the prefix "label"), as in:

decorate dress garnish adorn beautify embellish labeldecorate knife cutlery cutter eat silverware butcher carve labelknife etc...

Here, one of the first questions is:

word1 word2 word3... [tab] labellabel1

Is this right? In my case, each line contains from 2 to 300 words and only one label.

With respect to the code, the goal is to get the label-embedding matrix generated from the input file, i.e. we should be able to get the vector corresponding to each label. As we work with unigrams and we expect to have vectors of 100 dimensions, the initial code could be as follows:

        languages.registerLanguage("English");
        Pipeline nlp = await Pipeline.ForAsync(languages.English);

        IEnumerable<IDocument> docs = GetDocsFromSingleFile(file); //this method converts each line of the file into an IDocument object
        IEnumerable<IDocument> parsed = nlp.Process(docs);

        StarSpace ss = new StarSpace(languages.lang, 0, "starspace-model", StarSpace.ModelType.TagSpace);           
    ss.Data.TrainWordEmbeddings = true;
    ss.Data.Dimensions = 100;
    ss.Data.WordNGrams = 1;
    ss.Data.InputType = "LabeledDocuments";
        ss.Train(parsed);
        ...

Is this code right? I have just tried this code, and an error raises while training the model:

Exception thrown: System.ArgumentOutOfRangeException: 'Specified argument was out of the range of valid values.' Call Stack: Mosaik.Core.dll!Mosaik.Core.ThreadSafeFastRandom.ThrowMaxValueOutOfRange()

By the way, how can I get the label-embedding matrix after training the model?

Thank you, and congratulations for your work.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.