apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Why is the accuracy overestimated? #8601

Closed mstokes42 closed 5 years ago

mstokes42 commented 6 years ago

I have trained a model, and want to see how it performs on both the training and testing data. I'm using a SoftMax output with 2 hidden nodes for a binary classification problem.

# predict accuracy
acc = mx.metric.Accuracy()

lenet_model.score(train_iter, acc)
print('Training:',acc,F1)

lenet_model.score(val_iter, acc)
print('Testing:',acc,F1)

output = lenet_model.predict(val_iter)

This yields perfect results on the training data, and not very good results on the test data

Training: EvalMetric: {'accuracy': 1.0} Testing: EvalMetric: {'accuracy': 0.51111111111111107}

However, when I go and look at the individual predictions, I find that this accuracy on the test set is overestimated.

y_pred = np.round(lenet_model.predict(train_iter).asnumpy()[:,1])
y_true = phen[:ntrain]
print(metrics.confusion_matrix(y_true, y_pred))

y_pred = np.round(lenet_model.predict(val_iter).asnumpy()[:,1])
y_true = phen[ntrain:]
print(metrics.confusion_matrix(y_true, y_pred))

This yields the following confusion matrices: [[127 0] [ 0 76]] [[21 14] [13 3]]

I can see that the training set is indeed perfectly classified. The test set, however, only gets 24 of 51 correct, for an accuracy of 47%. Why does the model.score() function tell me that I have achieved 51% accuracy?

eric-haibin-lin commented 6 years ago

Metrics are stateful. If you pass a new Acc metric, is the estimation correct?

lenet_model.score(val_iter, mx.metric.Accuracy())
mstokes42 commented 6 years ago

I figured out the problem was with my batch_size on the val_iter. The sample size was not a multiple of the batch size, so some samples were being discarded by the model.score call that uses val_iter. By changing the batch size to equal the number of samples, the accuracy comes out correct.

Now I'm having an issue where by model performs well on the training data, but performs worse than random on the test data. I'd be better off taking the output of the classifier and predicting the opposite. When I apply the model to a different held-out validation set, though, the model again performs almost perfectly... what could explain this behavior?

eric-haibin-lin commented 6 years ago

Are you using k-fold validation?

szha commented 6 years ago

@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage.

For general "how-to" questions, our user forum (and Chinese version) is a good place to get help.

leleamol commented 6 years ago

Recommended Labels: "Question","Need Triage","Pending Requester Info"

szha commented 6 years ago

@leleamol @marcoabreu please refrain from manually adding "Need Triage" label while triaging, and be sure to remove the "Need Triage" label after triaging. Thanks.

vishaalkapoor commented 6 years ago

@mstokes42 If your confusion matrix is representative of your test/train set, it looks like you're using very few (~ 100) training examples with what appears to be LeNet-5. This is about 2 orders of magnitude too little data; in Lecun98 LeNet-5 was trained using MNIST (60k 32x32 training images).

The reason you're getting 100% accuracy on your train set and terrible accuracy on your test set is that LeNet-5 is memorizing your training data and not learning enough to generalize to your test set.

Take a look at the Convolutional Neural networks section for some examples with training sets: https://gluon.mxnet.io/chapter04_convolutional-neural-networks/cnn-scratch.html E.g.

batch_size = 64
num_inputs = 784
num_outputs = 10
def transform(data, label):
    return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32)
train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True, transform=transform),
                                      batch_size, shuffle=True)
test_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=False, transform=transform),
                                     batch_size, shuffle=False)

@mstokes42 I believe this is the issue you're experiencing, please reopen if it's not. @yzhliu Would you be able to close this one off? Thanks!

vandanavk commented 5 years ago

@sandeep-krishnamurthy Please close this issue based on @vishaalkapoor's comments above.

@mstokes42 Please feel free to reopen this issue if it has been closed in error.