Closed mfriedel closed 5 years ago
Hi @mfriedel, happy to help get to the bottom of this. I think the issue here may be overfitting in the CreateML-produced model. It could be that no train/test split was performed -- the default flow through the UI (in the current Beta) appears to be to use the entire dataset for training/validation unless you specifically create a train/test split. Suppose the example validation image you're seeing 1.0 predictions on was in the validation set, which the CreateML model overfit to -- would that explain what you're seeing?
A couple more things that might shed more light on what's going on:
If the model is reporting 100% confidence for some class on all input images, that is almost certainly a bug somewhere, and I'll be happy to help track it down.
Thank you, @znation! I agree that overfitting is potentially a problem here.
A few more notes:
I did manually create a train/test split in CreateML, and the only choice that seems to be available for Validation is Automatic. I am 99% sure the validation image I used for the numbers posted above was not in the train/test split. My training data came from Google's Open Images Dataset v4 and the validation image I downloaded from unsplash.com. I got the 100% confidence interval for three different validation images from unsplash (one for a rabbit, one for a snake, and one for a fish.)
I'm going to test all of this again, making sure that my train/test split for both models is identical, manually pulling out some validation images, and then report back.
@znation An update:
Definitely got 100% prediction probabilities on validation data that was not used in the training set. This was not true for all images, but definitely some. Interestingly, I also saw this behavior with a CoreML2 model I trained using TuriCreate, which I didn't see previously. (I was using v5.4)
Some hypotheses:
Also potentially of note: I did a software update yesterday for another reason, so I'm now on 10.15 Beta (19A536g). Not sure if that changed anything significant.
Thanks again for the response. Look forward to hearing your thoughts!
@mfriedel Interesting, that definitely sheds some light on the situation. By any chance, are the two models (Turi Create and Create ML) using the same feature preprocessor on the images? The default in Create ML is to use Vision Feature Print, a built-in featurizer in Apple platforms, while the default when training in Turi Create is to use a cross-platform featurizer resnet-50
. Training in Turi Create with model="VisionFeaturePrint_Scene"
may give more similar results to Create ML as well.
Hi @znation. We definitely used resnet-50
in the Turi Create model, so that is almost certainly contributing to what we are seeing. If I see anything else of note, I'll drop it in here. Looking forward to continuing to unpack both tools as they evolve.
I see, thanks @mfriedel! I'm going to close this issue for now, since it's not clear there is a bug here. Please reopen if you're able to get different predictions between Turi Create and CoreML for the same model, or if you're getting seemingly wrong/bad predictions from either Turi Create or Create ML models.
Hey everyone!
I’ve been playing around with CreateML on the beta version of Catalina, and I’ve been benchmarking my model results against a model we built with TuriCreate v5.4 on Google Colab a few months ago. I’ve built a multi-class image classification model (6 classes to predict different animal types).
Basically what I am seeing is this: On the same example validation image, the CoreML3 model that I built with CreateML will give the following types of prediction results for six classes:
Conversely, the CoreML2 model, built with TuriCreate, gives more reasonable (in my opinion) results for the same image.
Obviously the fish is predicted with less accuracy in the CoreML2 model (bad), but the remaining classes have non-zero probabilities (feels more realistic to me).
Relevant details:
Note that I have the following within XCode to check the numbers above:
The example validation images were taken from a different data source (unsplash) than the training images (Open Images Dataset v4), and while I did not visually inspect them to confirm this, the validation images are likely not part of the training dataset.
I validated this on a few images, and the fact that the CoreML3 artifact built with CreateML gives 100% prediction probabilities feels a little suspicious to me. Is this expected, because the neural network architecture is now different with Catalina/iOS13, or is something else going on here? I know that adding more data may solve the problem if the model is over-optimizing, but given that this was not an issue we saw previously, I’m not sure that is the only solution. Have you guys seen anything like this before?