PetrochukM / PyTorch-NLP

Basic Utilities for PyTorch Natural Language Processing (NLP)
https://pytorchnlp.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.21k stars 258 forks source link

Aligned FastText embeddings #19

Closed floscha closed 6 years ago

floscha commented 6 years ago

Adds a boolean aligned option to the FastText object's constructor.

If set to True, the FastText embeddings will be initialized with the aligned MUSE embeddings (see details here).

If not specified or set to False, the regular FastText embeddings are used. This way, the PR does not break any code written before the PR.

Example usage:


>>> from torchnlp.word_to_vector import FastText
>>> from scipy.spatial.distance import euclidean as dist

# Load aligned FastText embeddings for English and French
en_vectors = FastText(aligned=True)
fr_vectors = FastText(language='fr', aligned=True)

# Compare the euclidean distances of semantically related vs unrelated words
>>> dist(en_vectors['car'], fr_vectors['voiture'])
0.61194908618927
>>> dist(en_vectors['car'], fr_vectors['baguette'])
1.2417925596237183
PetrochukM commented 6 years ago

Rerunning this tests; there was a bug that prevented your test from being run on travis.


I suspect your test will fail because you did not add a mock file for fast_text: "pytorch-nlp/tests/_test_data/fast_text/wiki.simple.vec"

The idea of the "mock file" and "mock_urlretrieve" is to mock a download; therefore, you need to download aligned FastText embeddings and add it to the "pytorch-nlp/tests/_test_data/fast_text/" folder.


Thank for your work! Appreciate your time and thoughtful pull request.

PetrochukM commented 6 years ago

Please rebase master when you get a chance git rebase master! Renamed the tests/embeddings folder to tests/word_to_vector.

floscha commented 6 years ago

Thanks for your explanation on how you use mock files for testing. I wasn't aware of this before. I have now fixed the test according to your instructions and it passes.

Unfortunately, after rebasing, tests/nn/test_attention.py::TestAttention::test_forward now fails for Python 3.5 only (see https://travis-ci.org/PetrochukM/PyTorch-NLP/jobs/365802591#L855). Any idea what caused this test to break?

PetrochukM commented 6 years ago

Looks like it is a rounding error, WOW! I'll fix it. Sorry about that!

image

codecov-io commented 6 years ago

Codecov Report

Merging #19 into master will increase coverage by 0.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #19      +/-   ##
==========================================
+ Coverage   94.45%   94.46%   +0.01%     
==========================================
  Files          54       54              
  Lines        1515     1518       +3     
==========================================
+ Hits         1431     1434       +3     
  Misses         84       84
Impacted Files Coverage Δ
torchnlp/word_to_vector/fast_text.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf2dc46...f0d4003. Read the comment docs.

PetrochukM commented 6 years ago

You did it! Thanks for your contribution!