Closed WrRan closed 5 years ago
Yes, we've hit this issue multiple times, and I think adding a min_width
parameter where you suggest is a good idea. PR welcome. I'd call it min_width
or min_num_characters
, not mini_width
, and give it a default value of 0
.
I am working on it.
I think min_padding_length
is a better name.
And it may be a better idea to provide no default value for this parameter.
Because no default value makes users check the maximum value of ngram_filter_sizes
(in cnn_encoder
) equals the value of min_padding_length
. Otherwise, this subtle issue may be hit again after "usual" training process.
What do you think? @matt-gardner
Describe the bug I used a simple CNN on a text classification tasks. It worked well when training, but it broke when predicting.
To Reproduce For convenience, I provide a test case at allennlp-demo-producing-bug. Steps to reproduce the behavior
git clone https://github.com/WrRan/allennlp-demo-producing-bug.git demo
cd demo
Expected behavior Because testdata is same as traindata, it is unexpected to get a RuntimeError. (Moreover, the
predictor
used inallennlp predict
is trivial.)System (please complete the following information):
Additional context I think the error comes from the
TokenCharactersIndexer
under some edge cases and default settings. The error is produced by the settings and a special datapoint: Thecnn_encoders
with the setting:requires the word consisting of no less than 4 characters.
configs: https://github.com/WrRan/allennlp-demo-producing-bug/blob/685dcffd3755da60fe3fb360aa6f8338571ce86d/config/cnn_classifier.json#L49-L51
However, there is an unexpected datapoint:
I I I I I I I.
. Thus, prediction is broken.But training process works well. Why? It is in that
batch_size
is set as64
, and the sentenceI I I I I I I.
is feed to model with another good sentenceThe allennlp is awesome.
. Under such case, classTokenCharactersIndexer
producespadding_lengths
of7
(which is the length of the wordawesome
in the batch).Suggestions I think the key is the design of class
TokenCharactersIndexer
which producespadding_lengths
just taking consideration of single one datapoint. (in methodget_padding_lengths
). So it may be a good idea to add a parametermini_width
or something else toTokenCharactersIndexer
's initializer. In that case, we can config like this:and the class
TokenCharactersIndexer
may work like this:Above solution copes with my issue. And I think no default value of
mini_width
is a better idea. Does this make sense?Look forward to hearing from you. Thanks.