Closed david-waterworth closed 3 years ago
@david-waterworth There is a field for max_sequence_length
in the text_classification_json
reader, but not a min_sequence_length
. I think the simplest fix will be to add a field like that in the dataset reader (or create a new dataset reader), and deal with the padding accordingly.
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
@AkshitaB yes that should work fine, provided the tokeniser api allows you to obtain the padding token?
I've created a text classifier based on the
cnn-highway
encoder. It more or less works, but sometimes it crashes either during training or evaluation, the problem being that sometimes the input text isn't long enough for the CNN window size. It's mostly a problem during evaluation where I've been evaluating single strings, during training the batching tends to add enough padding to mask the issue.The error is:
The model is below, how do you pad each input sequence to a minimum length so the model doesn't occasionally fail?