Closed orenmel closed 5 years ago
This is an issue that we will address in the future (when the course turns toward NLP probably).
I had the same issue. If you want your data in the right order, you just need to get the sampler. Fortunately, it is saved during the creation of your TextClasDataBunch. If data_clas is the name of your TextClasDataBunch,
data_class.valid_dl.sampler
gives you the sampler for the validation set (replace valid_dl by test_dl if you want sampler for the test). If you want the sampler in the form of a list, do
sampler = [i for i in data_class.valid_dl.sampler]
This will give you the order in which your predictions are. The i-th element of sampler is the position of the i-th row of predictions in the original order.
@StatisticDean Thx for solving this! Just had the same issue: This should be working code:
sort_sample_mapping = {}
for count, i in enumerate(rnnlearner.data.valid_dl.sampler):
sort_sample_mapping[i] = count
tmp_pred = rnnlearner.get_preds()
y_predicted_unsorted = []
y_gt_unsorted = []
for idx in range(0, len(tmp_pred[0])):
pred = [x.tolist() for x in list(tmp_pred[0][idx])]
max_val = max(pred)
y_true_i = tmp_pred[1].tolist()[idx]
y_pred_i = pred.index(max_val)
y_predicted_unsorted.append(y_pred_i)
y_gt_unsorted.append(y_true_i)
y_predicted_sorted = []
y_gt_sorted = []
for i in range(0, len(y_predicted_unsorted)):
y_predicted_sorted.append(y_predicted_unsorted[sort_sample_mapping[i]])
y_gt_sorted.append(y_gt_unsorted[sort_sample_mapping[i]])
Thanks, @StatisticDean! That solved my problem.
Great idea! Anyone interested in submitting a test and fix as a PR?
Sure. I couldn't find guidelines to how we're supposed to write test, is there a link that explains how it should look like? Concerning the fix, should i add a get_preds_ordered method to RNNLearner.classifier, or override the get_preds method for RNNLearner, and if I override it, should i make "ordered" a parameter of the get_preds or just make it return ordered prediction in all cases?
I think overriding get_preds with a new argument ordered=True
sounds like the best idea. Also, use the current get_preds
method as base.
Fix has been merged so I'm closing this.
Just wanted to check in about this ordering issue: when I use ordered=True
with learn.get_preds()
for a text classification task (on a test set), I get almost all the predicted classes being the same as that with learn.predict()
. In my specific use case, I tested it with 20k test examples, and the labels predicted by the two methods disagree for 18 examples. The class probabilities predicted by the two methods are comparable, but not exactly equal. Here's some more details about my use case:
FastAI version:
fastai.__version__
'1.0.55'
Training procedure:
data_clas = TextClasDataBunch.from_df(path = "./", train_df = labelled_data_train,
valid_df = labelled_data_valid,
text_cols = "post_title",
label_cols = "Theme",
vocab=data_lm.train_ds.vocab,
bs=128)
learn.get_preds
for prediction:
data_clas = TextClasDataBunch.from_df(path = "./", train_df = labelled_data_train,
valid_df = labelled_data_valid,
test_df = unlabelled_data_to_classify,
text_cols = "post_title",
label_cols = "Theme",
vocab=data_lm.train_ds.vocab,
bs=64)
learn.data = data_clas learn.get_preds(ds_type = DatasetType.Test, ordered=True)
b) Use `learn.predict` with raw text (in the form that was fed into the making of the TextClasDataBunch) one-by-one:
learn.predict(unlabelled_data_to_classify["post_title"][0])
The predicted probabilities for the classes are not exactly equal with the two methods, though the predicted classes agree for all but 18 examples. An example of class probabilities predicted by the two methods:
**From learn.get_preds:** `tensor([3.1438e-03, 1.3936e-03, 4.4555e-03, 6.6271e-06, 5.3898e-05, 1.7909e-02, 9.7046e-01, 3.1595e-06, 1.4689e-04, 1.6489e-04, 2.2673e-03])`
**From learn.predict:** `tensor([2.3362e-03, 1.4350e-03, 3.3952e-03, 6.5755e-06, 4.2005e-05, 1.4958e-02, 9.7555e-01, 2.4424e-06, 1.3988e-04, 1.9556e-04, 1.9414e-03])`
Is there some source of randomness that can produce the difference in the predicted probabilities? Or is it because of the fact that I am not tokenizing the text while feeding it to `learn.predict` - I did so because the examples on the Inference page made it seem that the tokenization can happen within the learner, given it has a TextClasDataBunch (and the rules needed to make it) associated. Thanks for any clues!
@sgugger @StatisticDean
Yes, it's normal you see some differences. The predictions with get_preds
are batched, so padding is applied to make all text the same lengths. This can induce some small changes compared to predict (which is the one you should trust more).
Thanks @sgugger for the explanation - the trouble is that my unlabelled "test set" is pretty big, with more than a million examples. Using predict
on an example-by-example basis is pretty slow - that is why I was using get_preds
. However, if predict
is the more trustworthy method, is there a way to speed it up while getting predictions for a large set of examples?
Nope, though normally the difference should be quasi-inexistent since the linear decoder ignores the tokens coming from padding. Are you sure you properly passed the padding token index (if it's not 1)?
Maybe the difference comes from the fact the padding is done first, can you also try with padding at the end (by passing pad_first=False to TextClasDataBunch
)?
I tried pad_first=False
while creating the TextClasDataBunch
(passed it as an argument in .from_df()
), but that still gave different results from predict
andget_preds
in a small number of examples (and the probability of the classes isn't the same using the 2 methods).
I am not sure what you mean by properly passing the padding token index? I couldn't find a reference to that in the examples/documentation - could you link me to an example?
Describe the bug
The predictions returned from Learner.get_preds() are not aligned with the original ordered list of valid/test instances that were loaded to the learner because the valid/test instances are sorted internally by text length.
To Reproduce
Expected behavior
List of returned predictions should be ordered according to the order of input instances.
Screenshots
Additional context