NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 898 forks source link

seperate OOV with PAD in the vocabulary #693

Closed faneshion closed 5 years ago

faneshion commented 5 years ago

It is important to separate OOV with PAD by different indices. For keras, the PAD is assumed at the index 0. That's the basic to support the mask_zero in Embedding layer. Thus, I have placed the PAD at the index 0 and OOV at the index 1 in the Vocabulary. By updating this, I have witnessed the improvement of MVLSTM (e,g. from 0.63 to 0.66) on WikiQA. More experiments need to be conducted on the tutorials of WikiQA.

codecov-io commented 5 years ago

Codecov Report

Merging #693 into 2.2-dev will decrease coverage by 0.65%. The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           2.2-dev     #693      +/-   ##
===========================================
- Coverage     96.1%   95.45%   -0.66%     
===========================================
  Files           83       83              
  Lines         2541     2595      +54     
===========================================
+ Hits          2442     2477      +35     
- Misses          99      118      +19
Impacted Files Coverage Δ
matchzoo/preprocessors/cdssm_preprocessor.py 97.67% <100%> (ø) :arrow_up:
matchzoo/preprocessors/units/word_hashing.py 100% <100%> (ø) :arrow_up:
matchzoo/preprocessors/basic_preprocessor.py 100% <100%> (ø) :arrow_up:
matchzoo/embedding/embedding.py 100% <100%> (ø) :arrow_up:
matchzoo/models/mvlstm.py 100% <100%> (ø) :arrow_up:
matchzoo/preprocessors/units/vocabulary.py 100% <100%> (ø) :arrow_up:
matchzoo/preprocessors/dssm_preprocessor.py 97.29% <100%> (ø) :arrow_up:
matchzoo/engine/base_model.py 86.01% <0%> (-8.66%) :arrow_down:
...tchzoo/data_generator/callbacks/lambda_callback.py 100% <0%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 48f1a3c...ec175cb. Read the comment docs.