order of predictions in Learner.get_preds()

orenmel commented 6 years ago

Describe the bug

The predictions returned from Learner.get_preds() are not aligned with the original ordered list of valid/test instances that were loaded to the learner because the valid/test instances are sorted internally by text length.

To Reproduce

    path = untar_data(URLs.IMDB_SAMPLE)
    data_lm = TextLMDataBunch.from_csv(path)
    data_clas = TextClasDataBunch.from_csv(path, vocab=data_lm.train_ds.vocab)
    URLs.download_wt103_model()

    learn = RNNLearner.language_model(data_lm, pretrained_fnames=['lstm_wt103', 'itos_wt103'])
    learn.unfreeze()
    learn.fit(2, slice(1e-4,1e-2))
    learn.save_encoder('enc')

    learn = RNNLearner.classifier(data_clas)
    learn.load_encoder('enc')
    learn.fit(3, 1e-3)
    preds = learn.get_preds()

Expected behavior

List of returned predictions should be ordered according to the order of input instances.

Screenshots

Additional context

sgugger commented 5 years ago

This is an issue that we will address in the future (when the course turns toward NLP probably).

ludovicschwartz commented 5 years ago

I had the same issue. If you want your data in the right order, you just need to get the sampler. Fortunately, it is saved during the creation of your TextClasDataBunch. If data_clas is the name of your TextClasDataBunch,

data_class.valid_dl.sampler

gives you the sampler for the validation set (replace valid_dl by test_dl if you want sampler for the test). If you want the sampler in the form of a list, do

sampler = [i for i in data_class.valid_dl.sampler]

This will give you the order in which your predictions are. The i-th element of sampler is the position of the i-th row of predictions in the original order.

xelda1988 commented 5 years ago

@StatisticDean Thx for solving this! Just had the same issue: This should be working code:

sort_sample_mapping = {}
for count, i in enumerate(rnnlearner.data.valid_dl.sampler):
    sort_sample_mapping[i] = count

tmp_pred = rnnlearner.get_preds()

y_predicted_unsorted = []
y_gt_unsorted = []
for idx in range(0, len(tmp_pred[0])):
    pred = [x.tolist() for x in list(tmp_pred[0][idx])]
    max_val = max(pred)
    y_true_i = tmp_pred[1].tolist()[idx]
    y_pred_i = pred.index(max_val)
    y_predicted_unsorted.append(y_pred_i)
    y_gt_unsorted.append(y_true_i)

y_predicted_sorted = []
y_gt_sorted = []

for i in range(0, len(y_predicted_unsorted)):
    y_predicted_sorted.append(y_predicted_unsorted[sort_sample_mapping[i]])
    y_gt_sorted.append(y_gt_unsorted[sort_sample_mapping[i]])

orenmel commented 5 years ago

Thanks, @StatisticDean! That solved my problem.

jph00 commented 5 years ago

Great idea! Anyone interested in submitting a test and fix as a PR?

ludovicschwartz commented 5 years ago

Sure. I couldn't find guidelines to how we're supposed to write test, is there a link that explains how it should look like? Concerning the fix, should i add a get_preds_ordered method to RNNLearner.classifier, or override the get_preds method for RNNLearner, and if I override it, should i make "ordered" a parameter of the get_preds or just make it return ordered prediction in all cases?

sgugger commented 5 years ago

I think overriding get_preds with a new argument ordered=True sounds like the best idea. Also, use the current get_preds method as base.

sgugger commented 5 years ago

Fix has been merged so I'm closing this.

narendramukherjee commented 5 years ago

Just wanted to check in about this ordering issue: when I use ordered=True with learn.get_preds() for a text classification task (on a test set), I get almost all the predicted classes being the same as that with learn.predict(). In my specific use case, I tested it with 20k test examples, and the labels predicted by the two methods disagree for 18 examples. The class probabilities predicted by the two methods are comparable, but not exactly equal. Here's some more details about my use case: FastAI version:

fastai.__version__
'1.0.55'

Training procedure:

Trained language model on large text corpus.

Used the vocabulary of trained language model to build text classification data bunch:

data_clas = TextClasDataBunch.from_df(path = "./", train_df = labelled_data_train, 
                                  valid_df = labelled_data_valid, 
                                  text_cols =  "post_title", 
                                  label_cols = "Theme", 
                                  vocab=data_lm.train_ds.vocab, 
                                  bs=128)

2 different methods for predicting on the test set: a) Add test set to the TextClasDataBunch and use learn.get_preds for prediction:


data_clas = TextClasDataBunch.from_df(path = "./", train_df = labelled_data_train, 
                                  valid_df = labelled_data_valid, 
                                  test_df = unlabelled_data_to_classify,
                                  text_cols =  "post_title", 
                                  label_cols = "Theme", 
                                  vocab=data_lm.train_ds.vocab, 
                                  bs=64)

learn.data = data_clas learn.get_preds(ds_type = DatasetType.Test, ordered=True)

b) Use `learn.predict` with raw text (in the form that was fed into the making of the TextClasDataBunch) one-by-one:

learn.predict(unlabelled_data_to_classify["post_title"][0])


The predicted probabilities for the classes are not exactly equal with the two methods, though the predicted classes agree for all but 18 examples. An example of class probabilities predicted by the two methods:
**From learn.get_preds:** `tensor([3.1438e-03, 1.3936e-03, 4.4555e-03, 6.6271e-06, 5.3898e-05, 1.7909e-02, 9.7046e-01, 3.1595e-06, 1.4689e-04, 1.6489e-04, 2.2673e-03])`
**From learn.predict:** `tensor([2.3362e-03, 1.4350e-03, 3.3952e-03, 6.5755e-06, 4.2005e-05, 1.4958e-02, 9.7555e-01, 2.4424e-06, 1.3988e-04, 1.9556e-04, 1.9414e-03])`

Is there some source of randomness that can produce the difference in the predicted probabilities? Or is it because of the fact that I am not tokenizing the text while feeding it to `learn.predict` - I did so because the examples on the Inference page made it seem that the tokenization can happen within the learner, given it has a TextClasDataBunch (and the rules needed to make it) associated. Thanks for any clues!

@sgugger @StatisticDean

sgugger commented 5 years ago

Yes, it's normal you see some differences. The predictions with get_preds are batched, so padding is applied to make all text the same lengths. This can induce some small changes compared to predict (which is the one you should trust more).

narendramukherjee commented 5 years ago

Thanks @sgugger for the explanation - the trouble is that my unlabelled "test set" is pretty big, with more than a million examples. Using predict on an example-by-example basis is pretty slow - that is why I was using get_preds. However, if predict is the more trustworthy method, is there a way to speed it up while getting predictions for a large set of examples?

sgugger commented 5 years ago

Nope, though normally the difference should be quasi-inexistent since the linear decoder ignores the tokens coming from padding. Are you sure you properly passed the padding token index (if it's not 1)? Maybe the difference comes from the fact the padding is done first, can you also try with padding at the end (by passing pad_first=False to TextClasDataBunch)?

narendramukherjee commented 5 years ago

I tried pad_first=False while creating the TextClasDataBunch (passed it as an argument in .from_df()), but that still gave different results from predict andget_preds in a small number of examples (and the probability of the classes isn't the same using the 2 methods).

I am not sure what you mean by properly passing the padding token index? I couldn't find a reference to that in the examples/documentation - could you link me to an example?

fastai / fastai

order of predictions in Learner.get_preds() #975