Romanian Word Embeddings

These vectors was trained with 3 different methods (CBOW, Skip-Gram, FastText) from Gensim library. The dataset is a bunch of text that was taken from internet (news, comments, blogs etc.).

Please notice I do not claim that these vectors are the best for romanian language!

About Dataset

The text is pre processed and cleaned.

784.150.193 - Sentences (One of purpose is the sentence need to be bigger than 35 characters, including the spaces)
11.628.712.127 - Words
1.311.442 - Unique words
14,82 - AVG number of words in sentence

Method	Size	Min_count	Window	SET1 - Precision	SET2 - Precision	Download	Size
CBOW	300	25	5	14%	23%	Download	4.2 GB
CBOW	300	25	15	66%	91%	Download	4.2 GB
CBOW	300	25	20	67%	93%	Download	4.2 GB
Skip-Gram	300	25	5	71%	92%	Download	4.2 GB
Skip-Gram	300	25	15	79%	98%	Download	4.2 GB
Skip-Gram	300	25	20	79%	98%	Download	4.2 GB
FastText	300	25	5	66%	95%	Download	6.29 GB
FastText	300	25	15	72%	97%	Download	6.29 GB
FastText	300	25	20	74%	98%	Download	6.29 GB

SET1 and SET2 are sets with questions-answer with country and capitals, that was made by Romanian Academy (They have their own vectors, you can check it right there CoRoLa).

Example:

austria - vienna + amsterdam = netherlands (eng).
austria - viena + amsterdam = olanda (rom).

SET1 - 1892 analogies for European countries and their capitals

SET2 - 462 analogies for European countries and their capitals (subset of SET1)

from gensim.models import Word2Vec

model = Word2Vec.load('SG_300_20_15.model')

resultQuery = model.wv.most_similar('**WORD**')

for result in resultQuery:
    print(result)

In: spania
Out:
('italia', 0.8326004147529602)
('portugalia', 0.8248708248138428)
('castilla-leon', 0.7556794285774231)
('belgia', 0.7364105582237244)
('argentina', 0.7281147241592407)
('spania-', 0.727818489074707)
('brazilia', 0.7213218212127686)
('olanda', 0.6885160207748413)
('germania', 0.6858677864074707)
('anglia', 0.6833646297454834)

In: ilie
Out:
('adrian', 0.7264171242713928)
('andrei', 0.7138616442680359)
('valentin', 0.6969763040542603)
('dumitru', 0.673446536064148)
('llie', 0.6705739498138428)
('nicolae', 0.6643682718276978)
('vasile', 0.6577962636947632)
('marian', 0.6359540224075317)
('constantin', 0.6084895133972168)
('nicu', 0.6063842177391052)

In: ruble
Out: 
('grivne', 0.795340359210968)
('hrivne', 0.7273794412612915)
('copeici', 0.7101361155509949)
('dolari', 0.6791703104972839)
('yuani', 0.6516059041023254)
('rublă', 0.6284367442131042)
('kopeici', 0.6272767186164856)
('zloţi', 0.6005615592002869)
('usd', 0.5963905453681946)
('piaştri', 0.5942535996437073)

In: fizician
Out:
('matematician', 0.6948787569999695)
('savant', 0.6890636086463928)
('fizicianul', 0.6560385823249817)
('inventator', 0.653334379196167)
('astrofizician', 0.644870936870575)
('chimist', 0.6142269372940063)
('astronom', 0.6096892356872559)
('filozof', 0.604558527469635)
('teoretician', 0.6006152629852295)
('cercetător', 0.5948916673660278)

BlackKakapo / Romanian-Word-Embeddings

readme

Romanian Word Embeddings

About Dataset

Example: