Pytorch model converted to neuralnetwork crashes in swift

robertsulej commented 1 year ago

I struggle with deploying a big model into a swift application. I need it in the neuralnetwork format since a crucial part has to be a custom layer performing calculations on gpu. Problem occurs in the standard code though and I managed to narrow it down to ~simple code. It does not calculate anything usefull, but shows the problem.

Pytorch model is:

import torch

def some_fn(tenInput, tenFlow):

    shapeInt = torch.tensor(tenInput.shape, device=tenFlow.device)
    tenOnes = torch.ones(
        [ shapeInt[0], 1, shapeInt[2], shapeInt[3] ],
        dtype=tenFlow.dtype,
        device=tenFlow.device
    )
    tenOutput = torch.cat([ tenInput, tenOnes ], 1)

    # here I actually apply my custom layer, but it is not needed to trigger crashes...

    tenResult = tenOutput[:, :-1, :, :]
    tenMask = tenOutput[:, -1:, :, :]

    tenMask = torch.lt(tenMask, 0.999).expand(tenResult.shape)
    tenResult[tenMask] = 0.0

    return tenResult.to(torch.float32)

class TestModel(torch.nn.Module):

    def forward(self, x, flow):
        y = some_fn(x, flow)
        return y

Conversion to mlmodel is also standard:

import coremltools as ct

# a test pattern I use to validate outputs, not really
# relevent here, any x and y inputs are ok
w = 1024
h = 768
ch = 32

x = torch.zeros(1*ch*h*w, dtype=torch.float32)
y = torch.zeros(1*2*h*w, dtype=torch.float32)
for i in range(x.shape[0]):
    x[i] = 0.1 * (i % 13)
for i in range(y.shape[0]):
    y[i] = (i % 19) - 10

x = x.reshape((1,ch,h,w)).to('cuda')
y = y.reshape((1,2,h,w)).to('cuda')

# conversion:
m = TestModel().to('cuda').eval()
traced_model = torch.jit.trace(m, (x, y), check_trace=False)
mlmodel = ct.convert(
    traced_model,
    convert_to="neuralnetwork",
    inputs=[
        ct.TensorType(name="x", shape=x.shape),
        ct.TensorType(name="y", shape=y.shape),
    ],
    debug=False
)
mlmodel.save("test_model.mlmodel")

Then I add the model to a swift project in Xcode and load/run it with the code:

        guard let model = try? test_model() else {
            fatalError("loading failed")
        }

        let w = 1024 as NSNumber
        let h = 768 as NSNumber
        let c = 32 as NSNumber

        // input pattern, the same as in pytorch
        guard let x_inp = try? MLMultiArray(shape:[1,c,h,w], dataType:MLMultiArrayDataType.float32) else {
            fatalError("failed on frame 0")
        }
        guard let y_inp = try? MLMultiArray(shape:[1,2,h,w], dataType:MLMultiArrayDataType.float32) else {
            fatalError("failed on frame 1")
        }
        for i in 0..<x_inp.count {
            x_inp[i] = NSNumber(floatLiteral: 0.1 * Double(i % 13))
        }
        for i in 0..<y_inp.count {
            y_inp[i] = NSNumber(floatLiteral: Double(i % 19) - 10)
        }

        // run prediction - throws here:
        var output = try! model.prediction(x: x_inp, y: y_inp)

The prediction() function throws EXC_BAD_ACCESS with not much usefull information. Exception comes from the auto-generated let outFeatures = try model.prediction(...) line.

When I reduce the model further, I can get some hopefully usefull info.

The model:

def some_fn(tenInput, tenFlow):
    tenOutput = tenInput

    tenResult = tenOutput[:, :-1, :, :]
    tenMask = tenOutput[:, -1:, :, :]

    tenMask = torch.lt(tenMask, 0.999).expand(tenResult.shape)
    tenResult[tenMask] = 0.0

    return tenResult.to(torch.float32)

class TestModel(torch.nn.Module):

    def forward(self, x, flow):
        y = some_fn(x, flow)
        return y

Conversion code and swift code are the same. The exception on model.prediction() reads:

2022-10-24 21:57:47.344438+0200 FrameIterpTry[1547:42137] [espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "Invalid argument": scatter_nd_kernel: In TF_SCATTER_ND mode, Invalid index 1 into axis of size 1
 status=-6
2022-10-24 21:57:47.344546+0200 FrameIterpTry[1547:42137] [coreml] Error computing NN outputs -6
2022-10-24 21:57:47.344602+0200 FrameIterpTry[1547:42137] [coreml] Failure in -executePlan:error:.
FrameIterpTry/ContentView.swift:693: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=com.apple.CoreML Code=0 "Error computing NN outputs." UserInfo={NSLocalizedDescription=Error computing NN outputs.}
2022-10-24 21:57:47.348461+0200 FrameIterpTry[1547:42137] FrameIterpTry/ContentView.swift:693: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=com.apple.CoreML Code=0 "Error computing NN outputs." UserInfo={NSLocalizedDescription=Error computing NN outputs.}

I am using:

coremltools version: 6.0
MacOS version: Monterey 12.6 on MacBook Pro M1
PyTorch: 1.11

I could probably avoid problems if custom layers were supported with mlprogram format... are there any plans for rhis?

Thanks for help! Robert

TobyRoseman commented 1 year ago

The PyTorch traced model appears valid. traced_model(x,y) returns predictions without error.

mlmodel.predict({'x': x.numpy(), 'y': y.numpy()}) produces Segmentation fault: 11 error. So this is not a Xcode issue. This is either an issue in the Core ML Framework or coremltools is producing an invalid model.

In order to run the code on macOS, I had to remove the three to('cuda') calls.

robertsulej commented 1 year ago

Thanks for checking! I am using to('cuda') to play with 16-bit precision, but even leaving all precision related options to default I see the crash in my swift project.

Please. let me know if I can provide you with any additional tests or if there is a way to go around this problem.

TobyRoseman commented 1 year ago

Looks like this does work correctly if you convert to 'mlprogram' rather than 'neuralnetwork'. I know you can't use MIL because it doesn't support custom layers.

One possible workaround would be to use a pipeline model that contains a MIL model and a neuralNetwork model.

robertsulej commented 1 year ago

Yes, I already started working on such approach... it is going to be hard in my case, the custom features are repeating deep in the complex model structure.

Please, consider supporting custom layers in the mlprogram format.

apple / coremltools

Pytorch model converted to neuralnetwork crashes in swift #1644