Open dkirkby opened 7 years ago
Use the following command for profiling:
python -m cProfile -o profile.out ./evaluate.py -i word2vec.dat.4 --top-pairs 20
This process runs at 100% CPU (so is single threaded) and uses ~390Mb of memory.
Here are the top-10 entry points:
In [1]: import pstats
In [2]: p = pstats.Stats('profile.out')
In [3]: p.sort_stats('time').print_stats(10)
Mon Jan 16 08:16:38 2017 profile.out
197501073 function calls (197470716 primitive calls) in 1255.716 seconds
Ordered by: internal time
List reduced from 5678 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
22759354 642.944 0.000 642.944 0.000 {numpy.core.multiarray.dot}
80201 374.755 0.005 374.755 0.005 {method 'argsort' of 'numpy.ndarray' objects}
80200 55.315 0.001 1250.953 0.016 ./model.py:50(get_clue)
8301684 28.906 0.000 43.819 0.000 /Users/david/anaconda/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py:1690(apply_rules)
8180000 27.391 0.000 136.882 0.000 ./model.py:31(get_stem)
7704758 23.065 0.000 23.065 0.000 {method 'reduce' of 'numpy.ufunc' objects}
8168837 20.798 0.000 85.208 0.000 /Users/david/anaconda/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py:1679(_morphy)
8334300 18.429 0.000 20.591 0.000 /Users/david/anaconda/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py:1696(filter_forms)
73440688 14.319 0.000 14.319 0.000 {method 'endswith' of 'unicode' objects}
7559718 11.497 0.000 37.655 0.000 /Users/david/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2300(amin)
This indicates that most of the time is spent in these two lines:
# Calculate the cosine distances between the mean vector and all
# the words in our vocabulary.
cosines = np.dot(self.model.syn0norm[:, np.newaxis],
mean_vector).reshape(-1)
# Sort the vocabulary by decreasing cosine similarity with the mean.
closest = np.argsort(cosines)[::-1]
and that the stemming is not a bottleneck.
A tree decomposition of the normalized embedding vectors for all words in the vocabulary should help speed this up a lot. The basic idea is to quickly find a subset of words that are "close" to mean_vector
and then restrict the np.dot
and np.argsort
operations to this subset.
TODO: try this idea out using scipy KD-tree.
I just noticed this comment in the KD-tree docs, but I think it is still worth trying this out:
For large dimensions (20 is already large) do not expect this to run significantly faster than brute force. High-dimensional nearest-neighbor queries are a substantial open problem in computer science.
The AI is getting a bit slow, especially with the larger corpus. This issue is to run a profile and see if there are any obvious bottlenecks that would be easy to improve.