fnielsen / ordia

Wikidata lexemes presentations
https://ordia.toolforge.org
Apache License 2.0
24 stars 13 forks source link

FastText integration #110

Open fnielsen opened 3 years ago

fnielsen commented 3 years ago
SELECT ?lexeme ?form ?representation {
  ?lexeme dct:language wd:Q9035 ;
          ontolex:lexicalForm ?form .
  ?form ontolex:representation ?representation .
}
from os.path import expanduser
from wikidata2df import wikidata2df
from gensim.models import KeyedVectors

sparql = """
SELECT ?lexeme ?form ?representation {
  ?lexeme dct:language wd:Q9035 ;
          ontolex:lexicalForm ?form .
  ?form ontolex:representation ?representation .
}
"""

df = wikidata2df(sparql)

danish_wikidata_words = set(df.representation.values)

fasttext_input_filename = expanduser('~/data/fasttext/cc.da.300.vec')
fasttext_output_filename = expanduser('~/data/fasttext/cc.da.300.wikidata.vec')
with open(fasttext_input_filename) as fin, open(fasttext_output_filename, 'w') as fout:
    m = 0
    for n, line in enumerate(fin):
        if n == 0:
            fout.write(line)
        else:
            if line.split()[0] in danish_wikidata_words:
                m += 1
                fout.write(line)
    print((m, 300))
>>> model = KeyedVectors.load_word2vec_format(fasttext_output_filename)
>>> model.most_similar('tvivlsom')
[('tvivlsomme', 0.5897641181945801), ('tvivlsomt', 0.5589535236358643), ('dårlig', 0.4764111638069153), ('sandsynlig', 0.45774227380752563), ('alvorlig', 0.4522080421447754), ('usikker', 0.44048911333084106), ('reel', 0.44022220373153687), ('høj', 0.43882137537002563), ('betydelig', 0.43797844648361206), ('kvalitet', 0.4359009563922882)]
dpriskorn commented 3 years ago

what is this good for? What is FastText?

dpriskorn commented 3 years ago

Looked it up and found https://fasttext.cc/docs/en/crawl-vectors.html now I understand. So you want Ordia to output models based on lexemes? It would be CC0 and therefore unique I guess.