std::runtime_error: BNNS error when trying to do training

johanlantz commented 5 years ago

❓Question

I am trying to move TensorFlow Keras model from running on a server to the device.

I started off converting the model with createml and it worked like a charm. I could make predictions in no time, thanks for the great framework.

My issue started yesterday when I wanted to also start and do personalization/training on the device.

I updated the createml script so the model would also be trainable. This worked fine and I can see all the new info in xCode under Update and Parameters for my model. All looks fine.

However, when trying to use MLUpdateTask, it always crashes and the only output I have is: std::runtime_error: BNNS error

I have created a minimal example, in theory removing anything specific to my use-case but it keeps producing the same error.

I have tried the emoji drawing example and it runs fine on the same device so it must be either that I am doing something wrong or that there is something not compatible with the converted model. Everything is, however, compiling fine.

I am not using MLFeatureValue directly but the generated iemocapTrainingInput : MLFeatureProvider assuming this is ok.

My minimal example looks like this (it fails in the same way as when running with my real training data):

static func minExample() {
        // Input as 1x251x168 empty MLMultiArray
        guard let features = try? MLMultiArray(shape:[1,251,168], dataType:MLMultiArrayDataType.float32) else {
            fatalError("Unexpected runtime error when creating MLMultiArray")
        }

        // Output as single value (2) for testing
        guard let trainLabel = try? MLMultiArray.init([2]) else {
            fatalError("Unexpected error when creating trainLabel")
        }

        let trainingSample = iemocapTrainingInput(input1: features, output1_true: trainLabel)

        var trainingSamples = [MLFeatureProvider]()
        trainingSamples.append(trainingSample)
        let updatableModelURL = Bundle.main.url(forResource: "iemocap",
                                                withExtension: "mlmodelc")!

        do {
            let updateTask = try MLUpdateTask(forModelAt: updatableModelURL,
                                              trainingData: MLArrayBatchProvider(array: trainingSamples),
                                              configuration: nil) { _ in
                                                print("Completed training")
            }
            updateTask.resume()
        } catch {
            print("Failed to start update task")
        }
    }

The model description looks like this: Screenshot 2019-10-15 at 14 09 31

and the crash always occur here: Screenshot 2019-10-15 at 14 10 22

When I connected a ProgressHandler as shown in your example here: https://github.com/apple/coremltools/blob/master/examples/updatable_models/OnDeviceTraining_API_Usage.md

I got one callback for .trainingBegin but then it crashed so it seems to find the model and starts to do something before the BNNS error.

I have been with this for 2 days now and running out of ideas so all suggestions are welcome.

Thanks in advance

System Information

iOS13.2

johanlantz commented 5 years ago

This is the model structure in case it helps.

anilkatti commented 5 years ago

@johanlantz could you please share the script you used to mark this model updatable? The first thing that seems odd is output1 and output1_true are of different dimensions.

johanlantz commented 5 years ago

@anilkatti Thanks for the quick reply, going a bit nuts here.

The convertion script simply looks like this:

import coremltools
print coremltools.converters
coreml_model = coremltools.converters.keras.convert('best_model_iemocap_augmentation', respect_trainable=True)
coreml_model.save('iemocap.mlmodel')

For the prediction part (that does work) the input is the 1x251x168 MLMultiArray and the output is a vector with 4 classes. In this case it tries to make a mode assumption from an audio file and categorize it in 1 of 4 options. This works exactly as the Keras/tf version from Python, the prediction values are identical.

For the update part, I assume that each 1x251x168 input would need to match an index [0..3] in the array above in the prediction, therefor the output_true just taking one value seemed to make sense but I can be wrong (I can check again with the model author).

Super happy to help any way I can. This is all new to me so apologies in advance for any inconsistency or silly question.

anilkatti commented 5 years ago

@johanlantz thanks for sharing that script - seems reasonable to me. This is definitely not a silly question :) It could be a bug in convert method or in the framework - we will get to the bottom of this. I have one follow up question:

Based on "assumption from an audio file and categorize it in 1 of 4 options", the model seems to be a classifier however, classifiers always output either a string or int class label. output1 is a 4-element multi-array - which seems odd. Is best_model_iemocap_augmentation a classifier?

anilkatti commented 5 years ago

@johanlantz assuming your model is a classifier, may I suggest following steps in this Jupyter notebook to specify class_labels and predicted_feature_name while invoking convert? Look for convert_keras_to_mlmodel in:

https://github.com/apple/coremltools/blob/master/examples/updatable_models/updatable_mnist.ipynb

johanlantz commented 5 years ago

@anilkatti I will talk to the researcher tomorrow (I am approaching this from an engineering perspective so I lack some context).

From what I have done so far, I can see that from my input (a segment of processed audio) I get an array of probabilities back from the model for instance: [0.119403765, 0.029491829, 0.08774426, 0.7633601]

For the prediction, this was exactly what I needed and it also matched the output from the model in the python script.

I will collect more information tomorrow and also try to add your advice from the notebook.

Many thanks for the assistance, I will get back to you as soon as possible.

johanlantz commented 5 years ago

Hi @anilkatti

I have talked to the researcher and as you suspected his anticipation is that the output should also be a vector of size 4 and that is what he sees from running training with Keras. His comment was that it seems as argmax is applied at the end? Is there any way that could come from the conversion?

The only relevant line I can spot from the conversion process is this one:

Now adding input output1_true as target for categorical cross-entropy loss layer.

I have gotten a non-trained model from my colleague today and created a bootstrapped minimal example that reproduces the problem. I have uploaded it here: https://drive.google.com/file/d/1Il0_xuXF2_JwLzdJzSjX15vPiv1Abxy3/view?usp=sharing

It is for iOS since I have not yet updated MacOS.

In the KerasModel folder, you can run the python conversion script and then there is a standalone app with just two buttons, one for training and one for inference. Pressing on "Train" produces the error.

I will continue to investigate but if you have the time to download the minimal sample app, probably you will spot the issue much faster.

anilkatti commented 5 years ago

@johanlantz I could repro the issue with your sample code. I will share my findings soon.

anilkatti commented 5 years ago

@johanlantz, sorry about the delay. I discovered a bug in the framework that is causing the crash. In short, it is failing to propagate gradients past the first conv layer from end (conv2d_15). We are working on a fix at the framework level.

I could think of one simple workaround, but, I am not sure if that works with your use case. The idea is to explicitly mark the last two "fully connected" layers as updatable so, the gradient does not have to go past the conv layer. I have some sample code that does that. I've verified that on-device update is successful with this new model.

I'd be happy to discuss more about your specific use case and present other alternatives.

import coremltools

spec = coremltools.utils.load_spec("iemocap.mlmodel")

for layer_spec in spec.neuralNetwork.layers:
    layer_spec.isUpdatable = False

builder = coremltools.models.neural_network.NeuralNetworkBuilder(spec=spec)
builder.make_updatable(["dense_10", "dense_9"])

builder.inspect_layers()

coremltools.utils.save_spec(builder.spec, "new_iemocap.mlmodel")

anilkatti commented 5 years ago

Here's the new model for your reference.

new_iemocap.mlmodel.zip

johanlantz commented 5 years ago

@anilkatti Thank you so much for the effort. I will dig into this first thing tomorrow (I am in full-day training Tuesday->Thursday so apologies in advance if there is a delay in my response)

johanlantz commented 5 years ago

Hi @anilkatti

I have tried to use the supplied model with my testapp but exactly the same thing happens (BNNE error)

Then I tried to use this when converting the model (which I hope is close to what you suggested):

import coremltools

coreml_model = coremltools.converters.keras.convert('best_model_iemocap_augmentation')
coreml_model.save('iemocap.mlmodel')

spec = coremltools.utils.load_spec("iemocap.mlmodel")

for layer_spec in spec.neuralNetwork.layers:
    layer_spec.isUpdatable = False

builder = coremltools.models.neural_network.NeuralNetworkBuilder(spec=spec)
builder.make_updatable(["dense_2", "dense_1"])

builder.inspect_layers()

coremltools.utils.save_spec(builder.spec, "iemocap.mlmodel")

but If I do that I get the following error in xCode Screenshot 2019-10-22 at 16 42 15

The output from the python script is:

[Id: 13], Name: dense_2__activation__ (Type: softmax)
          Updatable: False
          Input blobs: [u'dense_2_output']
          Output blobs: [u'output1']
[Id: 12], Name: dense_2 (Type: innerProduct)
          Updatable: True
          Input blobs: [u'dense_1__activation___output']
          Output blobs: [u'dense_2_output']
[Id: 11], Name: dense_1__activation__ (Type: activation)
          Updatable: False
          Input blobs: [u'dense_1_output']
          Output blobs: [u'dense_1__activation___output']
[Id: 10], Name: dense_1 (Type: innerProduct)
          Updatable: True
          Input blobs: [u'flatten_1_output']
          Output blobs: [u'dense_1_output']
[Id: 9], Name: flatten_1 (Type: flatten)
          Updatable: False
          Input blobs: [u'max_pooling2d_3_output']
          Output blobs: [u'flatten_1_output']
[Id: 8], Name: max_pooling2d_3 (Type: pooling)
          Updatable: False
          Input blobs: [u'conv2d_3__activation___output']
          Output blobs: [u'max_pooling2d_3_output']
[Id: 7], Name: conv2d_3__activation__ (Type: activation)
          Updatable: False
          Input blobs: [u'conv2d_3_output']
          Output blobs: [u'conv2d_3__activation___output']
[Id: 6], Name: conv2d_3 (Type: convolution)
          Updatable: False
          Input blobs: [u'max_pooling2d_2_output']
          Output blobs: [u'conv2d_3_output']
[Id: 5], Name: max_pooling2d_2 (Type: pooling)
          Updatable: False
          Input blobs: [u'conv2d_2__activation___output']
          Output blobs: [u'max_pooling2d_2_output']
[Id: 4], Name: conv2d_2__activation__ (Type: activation)
          Updatable: False
          Input blobs: [u'conv2d_2_output']
          Output blobs: [u'conv2d_2__activation___output']
[Id: 3], Name: conv2d_2 (Type: convolution)
          Updatable: False
          Input blobs: [u'max_pooling2d_1_output']
          Output blobs: [u'conv2d_2_output']
[Id: 2], Name: max_pooling2d_1 (Type: pooling)
          Updatable: False
          Input blobs: [u'conv2d_1__activation___output']
          Output blobs: [u'max_pooling2d_1_output']
[Id: 1], Name: conv2d_1__activation__ (Type: activation)
          Updatable: False
          Input blobs: [u'conv2d_1_output']
          Output blobs: [u'conv2d_1__activation___output']
[Id: 0], Name: conv2d_1 (Type: convolution)
          Updatable: False
          Input blobs: [u'input1']
          Output blobs: [u'conv2d_1_output']

Which I think looks in line with what you suggested of only having the two last layers marked as updatable.

Am I missing something obvious?

anilkatti commented 5 years ago

@johanlantz It is puzzling! As a quick test, could you try saving the model to a different file? Also, could you share version of your coremltools (pip list) and Xcode (About section).

johanlantz commented 5 years ago

@anilkatti I tried saving the resulting model with another name but the result is the same.

I attach the complete reference project which includes the conversion scripts in the KerasModel folder. The one called converter_limited.py is the one trying to limit updatable to only the last 2 layers while converter.py is the normal one using respect_trainable.

Results on my side:

When using a model generated with respect_trainable = true => The BNNE error (same with the model you sent).
When using the script that only sets the last two layers as updateable => Error in xCode.

I have just updated to Catalina so the warning about version 3 vs version 4 when running createML is now gone but it did not affect the result when running on device.

My iPhone 6S is running 13.2 and the xCode version is Version 11.2 beta (11B41).

coremltools is version 3.0 The project with the new conversion script can be found here: https://drive.google.com/open?id=1odibLJVXeGJ-NkSBDlSEJ_7LNiK-gdtl

but nothing has really changed.

Anything I can provide, just shout.

anilkatti commented 5 years ago

@johanlantz sorry I misread your script. Let me clarify. I still want you to use respect_trainable = True but then, explicitly set last two layers as updatable.

Keras-CoreML convertor takes care of translating optimizer and loss from the keras model to the CoreML model. Let me know if that works.

import coremltools

coreml_model = coremltools.converters.keras.convert('best_model_iemocap_augmentation', respect_trainable=True)
coreml_model.save('iemocap.mlmodel')

spec = coremltools.utils.load_spec("iemocap.mlmodel")

for layer_spec in spec.neuralNetwork.layers:
    layer_spec.isUpdatable = False

builder = coremltools.models.neural_network.NeuralNetworkBuilder(spec=spec)
builder.make_updatable(["dense_2", "dense_1"])

builder.inspect_layers()

coremltools.utils.save_spec(builder.spec, "iemocap.mlmodel")

johanlantz commented 5 years ago

@anilkatti Ok understood, I misread the instructions. Re-adding respect-trainable removed the error when loading the model in xCode.

I do still have the BNNE exception just as before. I have tried using both the sample app and the real app with the real model but both behave the same also with this new approach of only having the two last layers trainable.

Test test project with the corrected script is here: https://drive.google.com/open?id=1L_7kG0tb3Hy62PJ0rd7Gwg-29V080oIv

So unfortunately still stuck with the same issue. It is super weird if the same project works on your end.

anilkatti commented 5 years ago

It is super weird. I downloaded your project and ran it just now and got "Completed training". No crash.

There are a couple of differences. My Xcode is a couple of builds newer (Version 11.2 beta 2 (11B44)). I downloaded it from developer.apple.com. And, I have been testing on a simulator. Simulator runtime should match that of the device but, I will try on a real device with iOS 13.2 next.

Could you quickly try on an iOS simulator and post your observation?

johanlantz commented 5 years ago

You are right, with the iPhone simulator training completes 😮

I have only tested on my 6S, tomorrow at work I will try on a newer device.

johanlantz commented 5 years ago

@anilkatti I have retested on an iPhone 11 using iOS13 and it works.

Same test on my iPhone 6S using 13.2 beta 2 consistently crashes with the BNNE error.

johanlantz commented 5 years ago

While we are pending findings on the two issues (conversion and iPhone 6S BNNE issue) I progressed my investigation and I stumbled into the next blocker.

With training working I would be able to do personalization. However, one of my objectives is to explore federated learning, in order to do this I would need to be able to access at least the weights after completing the training.

However, so far I have not managed to find a way to accomplish this. I can see a reference in the docs to .weights here: https://developer.apple.com/documentation/coreml/mlparameterkey/3362530-weights

But I have not been successful in getting any value out for this MLParameterKey (I might be doing it wrong).

Then I was thinking that perhaps I could save the updated MLModel and send it to the server for processing using coremltools but here: https://developer.apple.com/documentation/coreml/core_ml_api/personalizing_a_model_with_on-device_updates it states that it is the compiled model that is provided in the update so I have no chance of intermediately storing or accessing the .mlmodel in uncompiled format.

I think there were mentionings of federated learning in one of the WWDC talks but if I am not able to access the updated model (or at least the updated weights) I am not sure how to accomplish that.

Do you have any idea if this is possible? (sorry for changing the topic but I was not sure where to post this question).

anilkatti commented 5 years ago

I had a similar experience last night while testing on an iPhone 10R (no BNNS crash). Could you please a bug report (bugreport.apple.com) to track the iPhone 6s issue so, I can let the right folks look into it?

Re: accessing weights: You should be able to access updated weights in the context of model update using MLParameterKey.weights key.

You can request weights from the MLModel instance you receive in MLUpdateContext during the model update callbacks. I am going to share some objective-C sample code here. You could do something like this in TrainingBegin progress handler (for before-update weights) and update task completion handler (for after-update weights).

NSError *error = nil;
MLMultiArray *weights = [context.model parameterValueForKey:[MLParameterKey.weights scopedTo:@"dense_1"] error:&error];

Let me know if that makes sense. I can share better swifty sample code if required.

johanlantz commented 5 years ago

@anilkatti Thanks for the feedback. I created a ticket with id 7403222 in the Feedback Assistant for the BNNE issue on my iPhone6 referring to this thread.

No worries about the sample code just yet. I tried what you suggest today but I did it in the completion handler and .weights were not available as far as I could see. I will try the progress handler approach first thing tomorrow morning.

anilkatti commented 5 years ago

I tried what you suggest today but I did it in the completion handler and .weights were not available as far as I could see.

Did you pass in the layer name as scope for MLParameterKey?

johanlantz commented 5 years ago

I did not :-) I have just made a quick test but I did manage to get the weights out from a layer I knew the name of.

Is there a way to list the layer names or would I have to hardcode it?

The BNNE issue happened on the simulator as well now and it worked on the iPhone 6S a few times so that issue seems a bit tricky. I picked up the iPhone11 and the same code still runs fine there. But let's ignore that one for now.

I am getting quite close to accomplishing what I wanted for this experiment, thank you so much for all the help. Tomorrow I will hook things up to see if I get sensible output (the thing with the scalar vs a 4 element array as output1_true)

anilkatti commented 5 years ago

Currently, there is no way to get all the layer names at the runtime. You might have to get that information out-of-band along with the model. I also want to highlight that model provides "weights" as a parameter only for those layers that are marked as "updatable" in the proto since we did not see any practical use for non-updatable layers.

If you think getting layer names at the runtime via API is really helpful, please file a bug :) we love feature requests! Also, let us know if you can think of anything else. Like, does getting a diff w.r.t original weights in the completionHandler help?

I will keep you updated about 7403222. Good luck!

JacopoMangiavacchi commented 4 years ago

I have a similar model where I’m able to retrain the final InnerProduct layers but if I set the isUpdatable flag to true also to the Convolution layers (and his weights/bias) I always have the std::runtime_error: BNNS error as soon as I start the MLUpdateTask.

Is there any update on this issue or eventually a sample about how to train Convolution layers ?

More than happy to share model and source code if they could help to further investigate.

mrfarhadi commented 4 years ago

@JacopoMangiavacchi The std::runtime_error: BNNS error does not stop the model from training. It is actually not an error but a 'wrong' warning. It will be go away in new version of iOS. However, it does not block training. You should be able to train the conv layer even if you see this error message. Is it not the case? Is the model still unable to update the conv layer?

JacopoMangiavacchi commented 4 years ago

Thank you @mrfarhadi, unfortunately the training start but it stop immediately with this specific exception:

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: BNNS error

The model is available at: https://github.com/JacopoMangiavacchi/CNN-CoreML-Retrainable/raw/master/MNIST_Model.mlmodel

If eventually you could further investigate the source code for generating the model is here: https://github.com/JacopoMangiavacchi/MNIST-CoreML-Training/blob/master/MNIST-CoreML-Training/MNIST.swift

Caveat: this project is not using the Python CoreMLTools package but a swift library I made directly using the protobuf messages in this repo.

mrfarhadi commented 4 years ago

@JacopoMangiavacchi I am sorry, I confused your issue with something else. Seems the training stops in your case.

Looking at your model, it should be OK to mark conv layer updatable as well.

Can you file a bug report (bugreport.apple.com) and include some sample data and the code you use to update the model?

JacopoMangiavacchi commented 4 years ago

@mrfarhadi submitted feedback/bug FB7655774

mrfarhadi commented 4 years ago

@JacopoMangiavacchi Thanks for reporting. I took a look at your model and I could train it successfully with real dataset and the loss converged with a small tweak in your model. Here is how I changed your model and trained it:

I could not test your app as the repo does not have mnist_train.csv file. Instead I used some real dataset to train your model.

Your model is a NeuralNetwork, not a NeuralNetworkClassifier. This is OK. But as you are not using coremltools to make it updatable, you are missing some details. Look at this line in set_categorical_cross_entropy_loss https://github.com/apple/coremltools/blob/master/coremltools/models/neural_network/builder.py#L692 The shape of the trainingInput for true label is 1. So you need to pass the trainingInput as such. Your code has the trueLabel as: TrainingInput(name: "output_true", shape: [10]) You need to change the trainingInput dimension to 1 and pass actual value (e.g., 3). If you use coremltools methods you do not need to worry about this in your modelDescription as it automatically populates it based on the model type. If you want to keep the trainingInput with the same dimension, you can change your model to NeuralNetworkRegressor and use mean squared error loss for it.

So in summary: 1- Change the trainingInput for ‘output_true’ shape to [1] 2- Pass training data with shape [1 ] for the labels 3- Consider using make_updatable and set_categorical_cross_entropy_loss methods in coremltools to make sure these details are passed correctly

Hope this helps.

JacopoMangiavacchi commented 4 years ago

Fantastic, totally make sense. Thank you @mrfarhadi I'll update the sample according to your suggestions and I'll keep you updated.

JacopoMangiavacchi commented 4 years ago

@mrfarhadi I was sure I understood your clean instructions as well as the make_updatable and set_categorical_cross_entropy_loss methods but I'm still having the same runtime exception as soon as the train start.

If you have a chance to take a look again I've updated the source repo adding also the mnist train dataset so you can now easily test the app.

I've put also the new model with the right shape for the true labels back at https://github.com/JacopoMangiavacchi/CNN-CoreML-Retrainable/raw/master/MNIST_Model.mlmodel

mrfarhadi commented 4 years ago

@JacopoMangiavacchi Thanks for the modification. Looked at your commit and it looks good. We passed the first issue. I tried your app on my end with the 'current' OS and got the same issue you are facing. However, I can provide you a workaround. First of all keep in mind that this issue will be gone by next OS release. For now, you can cut the first conv layer and your model becomes:

                            Convolution(name: "conv2",
                                        input: ["image"],
                                        output: ["outConv2"],
                                        outputChannels: 32,
                                        kernelChannels: 1,
                                        nGroups: 1,
                                        kernelSize: [3, 3],
                                        stride: [1, 1],
                                        dilationFactor: [1, 1],
                                        paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                            EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                        outputShape: [],
                                        deconvolution: false,
                                        updatable: true)
                            ReLu(name: "relu2",
                                 input: ["outConv2"],
                                 output: ["outRelu2"])
                            Pooling(name: "pooling2",
                                    input: ["outRelu2"],
                                    output: ["outPooling2"],
                                    poolingType: .max,
                                    kernelSize: [2, 2],
                                    stride: [2, 2],
                                    paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                        EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                    avgPoolExcludePadding: true,
                                    globalPooling: false)
                            Convolution(name: "conv3",
                                        input: ["outPooling2"],
                                        output: ["outConv3"],
                                        outputChannels: 32,
                                        kernelChannels: 32,
                                        nGroups: 1,
                                        kernelSize: [2, 2],
                                        stride: [1, 1],
                                        dilationFactor: [1, 1],
                                        paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                            EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                        outputShape: [],
                                        deconvolution: false,
                                        updatable: true)
                            ReLu(name: "relu3",
                                 input: ["outConv3"],
                                 output: ["outRelu3"])
                            Pooling(name: "pooling3",
                                    input: ["outRelu3"],
                                    output: ["outPooling3"],
                                    poolingType: .max,
                                    kernelSize: [2, 2],
                                    stride: [2, 2],
                                    paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                        EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                    avgPoolExcludePadding: true,
                                    globalPooling: false)
                            Flatten(name: "flatten1",
                                    input: ["outPooling3"],
                                    output: ["outFlatten1"],
                                    mode: .last)
                            InnerProduct(name: "hidden1",
                                         input: ["outFlatten1"],
                                         output: ["outHidden1"],
                                         inputChannels: 1152,
                                         outputChannels: 500,
                                         updatable: true)
                            ReLu(name: "relu4",
                                 input: ["outHidden1"],
                                 output: ["outRelu4"])
                            InnerProduct(name: "hidden2",
                                         input: ["outRelu4"],
                                         output: ["outHidden2"],
                                         inputChannels: 500,
                                         outputChannels: 10,
                                         updatable: true)
                            Softmax(name: "softmax",
                                    input: ["outHidden2"],
                                    output: ["output"])

You should be able to train above model with the current OS.

JacopoMangiavacchi commented 4 years ago

Thank you so much @mrfarhadi. I'm able to train now and validate performance for my real scenario now. Btw, I can't wait for 'next OS release' !!

JacopoMangiavacchi commented 4 years ago

I wanted to confirm that training from scratch this MNIST/CNN model converged and I'm obtaining the expected accuracy. Execution time is also comparable with other server/cloud frameworks using CPU. So exited about the opportunity here.

Just a couple of more questions if I can abuse of your help @mrfarhadi:

I'm confused about the output shape of the Flatten layer after the second Pooling/Flatting. I've a size of 1152 on the CoreML model but trying to replicate exactly the same model architecture with TF2/Keras I have a shape of 800. Source code of the TF2 Python notebook are on the TF2 folder on my repo if you have time to take a look (https://github.com/JacopoMangiavacchi/MNIST-CoreML-Training/blob/master/TF2/MNIST_TF2.ipynb)
Better looking at the make_updatable function in builder.py I can see on line 608 that it also get weights from LSTMWeightParams field types and mark those to updatable. Does this mean that now also LSTM layers are trainable ?

mrfarhadi commented 4 years ago

@JacopoMangiavacchi Glad to hear that!

1- Seems to me that you probably need to change the kernel size of the second conv layer to (2, 2) Here is the snippet to create similar model in Keras:

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D

input_shape = (28, 28, 1)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, kernel_size=(2, 2), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.summary()

2 - No. LSTM layers are not trainable. Sorry for the confusion in the code.

JacopoMangiavacchi commented 4 years ago

Just a little spam here to thank @mrfarhadi again for your precious help.

https://medium.com/@JMangia/mnist-cnn-core-ml-training-c0f081014fa6

mrfarhadi commented 4 years ago

You are most welcome @JacopoMangiavacchi. Glad that it worked out for you. Read your medium and it’s amazing. Thanks for sharing.

TobyRoseman commented 3 years ago

There is a lot of correspondence in this issue, but it sounds like things have been resolved.

If things have not been resolved, please open a new issue with complete steps to reproduce the problem.

apple / coremltools

std::runtime_error: BNNS error when trying to do training #490

❓Question

System Information