facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

Segmentation Fault when training after "initFromTsv" #297

Open SvenAG opened 4 years ago

SvenAG commented 4 years ago

Hi,

first of all thank you for this great project - my colleagues and me love using StarSpace.

I recently pretrained a FastText model and then converted it to the tsv format (no first line and whitespace seperation between words and vectors). I wrote a script to add randomly initalized label vectors at the end of the tsv.

The model is loaded and the vocabulary and label size seem to be correct. I use the following to load the model: sp = sw.starSpace(arg)

sp.init()

    sp.initFromTsv('../models/fast_text__medical_texts_labels.tsv')
    sp.train()

However I always end up with a segmentation fault:

Start to load a trained embedding model in tsv format. Loading dict from model file : ../models/fast_text_medical_texts_labels.tsv Number of words in dictionary: 347312 Number of labels in dictionary: 2923 Initialized model weights. Model size : matrix : 2350235 500 Loading model from file ../models/fast_text_medical_texts_labels.tsv Model loaded. Training epoch 0: 0.001 3.33333e-06 Segmentation fault

What also really confuses me is the matrix size: The first dimension is way bigger than words+labels. Am I missing something here?

Another weird observation that I made is the following: When I specify a test file, Starspace loads the test instances, however it does not load the training instances from the training file I specified. When I do the training without the initFromTsv everything works as expected.

Thanks!

Best, Sven