delip / PyTorchNLPBook

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
Apache License 2.0
1.96k stars 799 forks source link

Agnews Classifier Predict error #12

Open sudarshan85 opened 5 years ago

sudarshan85 commented 5 years ago

Hello, Thanks for a great book. I've been working through the examples and its really helpful. One issue I noticed in the agnews classifier, is when I was running through the predict_category function I got an error when trying to predict one of the sports category:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1b3d5180f20c> in <module>
      3   print("="*30)
      4   for sample in sample_group:
----> 5     pred = predict_category(sample, classifier, dc.vectorizer, dc._train_ds.max_seq_length+1)
      6     print(f"Prediction: {pred['category']} (p={pred['prob']:0.2f})")
      7     print(f"\t + Sample: {sample}")

<ipython-input-16-75dc8b0470ad> in predict_category(title, classifer, vectorizer, max_length)
     12   """
     13   title = preprocess_text(title)
---> 14   vectorized_title = torch.tensor(vectorizer.vectorize(title, max_length))
     15 
     16   # add batch dim so you have a batch of size 1

~/nlpbook/ag/ag/vectorizer.py in vectorize(self, title, vector_len)
     33 
     34     out_vector = np.zeros(vector_len, dtype=np.int64)
---> 35     out_vector[:len(vector)] = vector
     36     out_vector[len(vector):] = self.title_vocab.mask_idx
     37 

ValueError: cannot copy sequence with size 22 to array axis with dimension 21

The max_seq_length in the training set was 20, and in this line in the text, we pass in max_seq_length+1 effectively making the sequence 21 tokens long:

prediction = predict_category(sample, classifier, vectorizer, dataset._max_seq_length + 1)

However, the required length is 22, so when I changed the function call to pass max_seq_length+2, it worked. This begs the more general question:

When the test data's max_seq_length could potentially be larger than that of the training data, what do we do? How do we handle that? Do we just pass in a larger value for the max_seq_length? Even if we do that, how do we foresee how big of a value we might need?

Thanks.