CreateML/CoreML3 vs TuriCreate/CoreML2

mfriedel commented 5 years ago

Hey everyone!

I’ve been playing around with CreateML on the beta version of Catalina, and I’ve been benchmarking my model results against a model we built with TuriCreate v5.4 on Google Colab a few months ago. I’ve built a multi-class image classification model (6 classes to predict different animal types).

Basically what I am seeing is this: On the same example validation image, the CoreML3 model that I built with CreateML will give the following types of prediction results for six classes:

Fish: 1.000000
Mouse: 0.000000
Hamster: 0.000000
Rabbit: 0.000000
Snake: 0.000000
Lizard: 0.000000

Conversely, the CoreML2 model, built with TuriCreate, gives more reasonable (in my opinion) results for the same image.

Fish: 0.586925
Rabbit: 0.282850
Hamster: 0.097322
Mouse: 0.032093
Snake: 0.000420
Lizard: 0.000390

Obviously the fish is predicted with less accuracy in the CoreML2 model (bad), but the remaining classes have non-zero probabilities (feels more realistic to me).

Relevant details:

The data/model we are using can be found here: https://colab.research.google.com/github/skafos/colab-example-models/blob/master/ImageClassification/more_pets.ipynb
The example iOS App I’m using is this one: https://github.com/skafos/ImageClassificationIOS
I’m running these benchmarks on a 5th generation iPad with iOS 13 installed
The version of CreateML I’m using is 0.5 (14561.3)
The version of macOS I have installed is: 10.15 Beta (19A526h)

Note that I have the following within XCode to check the numbers above:

let classifications = results as! [VNClassificationObservation]
print(classifications)

The example validation images were taken from a different data source (unsplash) than the training images (Open Images Dataset v4), and while I did not visually inspect them to confirm this, the validation images are likely not part of the training dataset.

I validated this on a few images, and the fact that the CoreML3 artifact built with CreateML gives 100% prediction probabilities feels a little suspicious to me. Is this expected, because the neural network architecture is now different with Catalina/iOS13, or is something else going on here? I know that adding more data may solve the problem if the model is over-optimizing, but given that this was not an issue we saw previously, I’m not sure that is the only solution. Have you guys seen anything like this before?

znation commented 5 years ago

Hi @mfriedel, happy to help get to the bottom of this. I think the issue here may be overfitting in the CreateML-produced model. It could be that no train/test split was performed -- the default flow through the UI (in the current Beta) appears to be to use the entire dataset for training/validation unless you specifically create a train/test split. Suppose the example validation image you're seeing 1.0 predictions on was in the validation set, which the CreateML model overfit to -- would that explain what you're seeing?

A couple more things that might shed more light on what's going on:

Try some other random images from the dataset and see what the model reports for them (vs. ground truth).
Try some images from outside the original dataset (i.e. your own photos, or some from elsewhere).

If the model is reporting 100% confidence for some class on all input images, that is almost certainly a bug somewhere, and I'll be happy to help track it down.

mfriedel commented 5 years ago

Thank you, @znation! I agree that overfitting is potentially a problem here.

A few more notes:

I did manually create a train/test split in CreateML, and the only choice that seems to be available for Validation is Automatic. I am 99% sure the validation image I used for the numbers posted above was not in the train/test split. My training data came from Google's Open Images Dataset v4 and the validation image I downloaded from unsplash.com. I got the 100% confidence interval for three different validation images from unsplash (one for a rabbit, one for a snake, and one for a fish.)

I'm going to test all of this again, making sure that my train/test split for both models is identical, manually pulling out some validation images, and then report back.

mfriedel commented 5 years ago

@znation An update:

Definitely got 100% prediction probabilities on validation data that was not used in the training set. This was not true for all images, but definitely some. Interestingly, I also saw this behavior with a CoreML2 model I trained using TuriCreate, which I didn't see previously. (I was using v5.4)

Some hypotheses:

The model does a great job identifying snakes because they are distinctive, and the 1/0/0/0/0/0 probabilities I'm getting aren't a problem. This happens less frequently with the other classes.
Neither model has been finely tuned and so these wonky results aren't particularly concerning
There is a difference between the underlying neural networks in CoreML2 and CoreML3 that is making it more likely for the image classification model to strongly favor one class in a multi-class problem. Not sure if this is a bug or just something worth understanding better. After more digging, I'm leaning towards the latter.

Also potentially of note: I did a software update yesterday for another reason, so I'm now on 10.15 Beta (19A536g). Not sure if that changed anything significant.

Thanks again for the response. Look forward to hearing your thoughts!

znation commented 5 years ago

@mfriedel Interesting, that definitely sheds some light on the situation. By any chance, are the two models (Turi Create and Create ML) using the same feature preprocessor on the images? The default in Create ML is to use Vision Feature Print, a built-in featurizer in Apple platforms, while the default when training in Turi Create is to use a cross-platform featurizer resnet-50. Training in Turi Create with model="VisionFeaturePrint_Scene" may give more similar results to Create ML as well.

mfriedel commented 5 years ago

Hi @znation. We definitely used resnet-50 in the Turi Create model, so that is almost certainly contributing to what we are seeing. If I see anything else of note, I'll drop it in here. Looking forward to continuing to unpack both tools as they evolve.

znation commented 5 years ago

I see, thanks @mfriedel! I'm going to close this issue for now, since it's not clear there is a bug here. Please reopen if you're able to get different predictions between Turi Create and CoreML for the same model, or if you're getting seemingly wrong/bad predictions from either Turi Create or Create ML models.

apple / turicreate

CreateML/CoreML3 vs TuriCreate/CoreML2 #2256