EducationalTestingService / skll

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
http://skll.readthedocs.org
Other
551 stars 67 forks source link

Buggy interaction in `Learner.predict()` when used via API #621

Closed desilinguist closed 4 years ago

desilinguist commented 4 years ago

There's a bug in Learner.predict() that only surfaces via the API for probabilistic learners.

from skll import Learner
from skll.data import NDJReader

l = Learner('SGDClassifier', probability=True)
train_fs = NDJReader.for_path('examples/iris/train/example_iris_features.jsonlines').read() 
test_fs = NDJReader.for_path('examples/iris/test/example_iris_features.jsonlines').read()
l.train(train_fs, grid_search=False)

Now, if I want to get the most likely labels in memory but write out the class probabilities to disk, I can actually do it the way the API is written:

labels = learner.predict(test_fs, class_labels=True, prediction_prefix='blah')

Doing so actually raises the following error because we try to write out the probabilities even though we end up only computing labels in the code (due to class_labels being True):

/work/skll/skll/learner/__init__.py in predict(self, examples, prediction_prefix, append, class_labels)
   1448                     for example_id, class_probs in zip(example_ids, yhat):
   1449                         print('\t'.join([str(example_id)] +
-> 1450                                         [str(x) for x in class_probs]),
   1451                               file=predictionfh)
   1452                 else:

TypeError: 'numpy.int64' object is not iterable

We should either explicitly disallow this case (it's a little weird to want labels in memory but probabilities on disk) or fix the code so that this work correctly.

The latter would mean:

  1. If learner.probability is True, always write out probabilities else write out the class labels.
  2. If class_labels is True, return class labels else return the class indices.

Thoughts?

desilinguist commented 4 years ago

@mulhod @bndgyawali @aoifecahill ?

mulhod commented 4 years ago

I like the latter option. What happens, though, if both learner.probability is True and class_labels is True? What should be returned/printed out in that scenario?

desilinguist commented 4 years ago

@mulhod I don't understand your question, that's the exact weird interaction I was talking about.

mulhod commented 4 years ago

Maybe I am misunderstanding your proposal. I thought it was proposing a solution for making sense of the various possibilities.

The latter would mean:

1. If learner.probability is True, always write out probabilities else write out the class labels.
2. If class_labels is True, return class labels else return the class indices.

Above I see what should happen in each case where one thing is True. Is this supposed to resolve the conflict?

desilinguist commented 4 years ago

Yes, if you treat them independently as I am proposing, then the above code (where both are True) will work even though it's bit weird. That weird case is the thing that's broken right now. So, it's not 1 or 2, it's 1 and 2.

mulhod commented 4 years ago

Oh, I didn't see the return/write out distinction until now.

mulhod commented 4 years ago

Yeah, I don't like the in-memory vs. written-out-to-disk representation difference. I think it should be disallowed.

ghost commented 4 years ago

i would also say fix the code so that this work correctly.. We do not want two things happening in memory and disk.

desilinguist commented 4 years ago

Fair enough. So:

  1. If class_labels is True, it will write out AND return class labels irrespective of learner.probability.
  2. If class_labels is False and learner.probability is True, it will write out AND return probabilities.
  3. If class_labels is False and learner.probability is False, it will write out AND return class indices.
desilinguist commented 4 years ago

This is not exactly right. We always want to write out class labels and not indices because indices are internal to SKLL. So, 3 is actually:

  1. If class_labels is False and learner.probability is False, it will write out class labels AND return class indices.
desilinguist commented 4 years ago

As part of this, I will set the default value of the class_labels keyword argument for Learner.predict() to be True instead of False since it doesn't make sense to return class indices by default.

💥 This will be breaking change since now to get probabilities as outputs you will explicitly need to set class_labels to be False. 💥