dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

[PERFORMANCE] Improve vocab lookup performance by working with a dict() directly #1382

Closed shishirb126 closed 3 years ago

shishirb126 commented 3 years ago

Description

The v0.x vocab implementation uses a custom dict implementation to handle case where unknown tokens are specified. When looking up multiple tokens, it is faster to work with a Python dict directly. A similar change has already been made in the master branch. This change pulls it to the v0.x branch.

Results from a micro benchmark below. Current implementation:

In [4]: v = Vocab({k:1 for k in ['a', 'b', 'c']})
In [5]: %timeit v['a']
854 ns ± 0.819 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit v['abc']
861 ns ± 8.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: keys=['a', 'c', 'c', 'b', 'c', 'c', 'c', 'c', 'a', 'b'] * 1000
In [8]: %timeit v[keys]
3.98 ms ± 9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: v = Vocab({k:1 for k in ['a', 'b', 'c']}, unknown_token=None)
In [10]: %timeit v['a']
516 ns ± 0.762 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [11]: %timeit v[keys]
687 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With the patch:

In [10]: v = Vocab({k:1 for k in ['a', 'b', 'c']})
In [11]: %timeit v['a']
734 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [12]: %timeit v['abc']
734 ns ± 2.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: keys=['a', 'c', 'c', 'b', 'c', 'c', 'c', 'c', 'a', 'b'] * 1000
In [14]: %timeit v[keys]
1.43 ms ± 2.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: v = Vocab({k:1 for k in ['a', 'b', 'c']}, unknown_token=None)
In [16]: %timeit v['a']
598 ns ± 3.29 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [17]: %timeit v[keys]
626 µs v 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Checklist

Essentials

Changes

Comments

cc @dmlc/gluon-nlp-team

leezu commented 3 years ago

This PR is currently blocked by an unrelated CI issue. I opened https://github.com/dmlc/gluon-nlp/pull/1383/ to attempt fixing it.

mli commented 3 years ago

Job PR-1382/2 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1382/2/index.html

mli commented 3 years ago

Job PR-1382/3 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1382/3/index.html