facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.84k stars 4.71k forks source link

pre trained wiki models contain bad words due to bad characters #284

Open matteoredaelli opened 7 years ago

matteoredaelli commented 7 years ago

for example

† # / »

model.most_similar("milano") [('†milano', 0.7980508208274841), ('milano,', 0.7701622247695923), ('milano,de', 0.7669407725334167), ('–milano', 0.7577754259109497), ('milanoli', 0.7502109408378601), (',milano', 0.7345211505889893), ('milano,mursia', 0.7304637432098389), ('milanon', 0.7298922538757324), ('emilano', 0.7196317315101624), ('/milano', 0.7001037001609802)]

model.most_similar("grigio") [('grigio#grigio', 0.8390243053436279), ('bianco/grigio', 0.8058035969734192), ('blu/grigio', 0.7966246604919434), ('scuro', 0.7949448227882385), ('bigrigio', 0.7486952543258667), ('grigiastro', 0.7412316203117371), ('rossastro', 0.7349599003791809), ('grigio/marrone', 0.7331645488739014), ('marrone', 0.7275819778442383), ('colore', 0.7251174449

[('juventus»', 0.8815842866897583), ('juventusque', 0.8325179219245911), ('#juventus', 0.8293249607086182), ('sampdoria', 0.798779308795929), ('bianconera', 0.7963981628417969), ('bianconeri', 0.791951060295105), ('juventina', 0.7786478996276855), ('juventute', 0.7710845470428467), ('juventa', 0.7657526135444641), ('juventini', 0.7571936249732971)]

model.most_similar("violino")

EdouardGrave commented 7 years ago

Hi @matteoredaelli,

Thank you for opening the issue.

We are aware of this problem, and are working on better pre-trained word vectors, which should be available soon!

Best, Edouard.