Open francescojalvo opened 4 years ago
Setting bidirectional to False i.e. uni directional makes conversion work. I believe this is your clue.
Mismatch: 50%
Seems to me this is not a converter issue but a PyTorch setup issue. There is a mismatch in shape of initial_h0 (2,1,128) with the resulting ex_hn (1, 1, 256). Just print their shapes. I do not remember from the top of my head how to present h0/c0 for a bidirectional network but the pytorch documentation should have an example. h0 = torch.randn(num_layers num_directions, BATCH_SIZE, hidden_size) or h0 = torch.randn(num_layers , BATCH_SIZE, hidden_size num_directions) What do you think?
We had the same issue in our codebase and at the end we found this is a reproducible case from coremltools unit tests. This is why I'm really puzzled about why this passes via coremltools pipelines. Is it possible that pipelines is not running on MacOS and the output validation is not being executed? Unidirectional LSTM works.
About the sizes, there is specific code to reshape the results from bidirectional LSTM in Torch to make it compatible with CoreML. It should be around there btw because the assert always fails by 50%... The first part of the output (128) is ok an the rest is different
Mhh. How about you use torch.manual_seed(0) # deterministic output and reduce the hidden_size to 2 and just print out the values of the h0/c0 and ex_hn/ex_cn for clues?
Out of curiosity, I was digging a bit deeper. Issue still unsolved but more intriguing and with a little extra insight I hope that helps. Hope you do not mind.
Attached the more verbose version of your code snippet that I used to produce "result.txt", also attached. modified.py > result.txt modified.txt result.txt
and summary.txt with the analysis. summary.txt
In a nut shell, summary from summary.txt
When you do NOT pass h0/c0 (i.e. zero input for h0/c0) then the output 'y' of the BLSTM looks ok. Both forward and backward parts i.e. 100%
Look at the log /output below especially the annotated parts with #####
Even better, h_n[0] is in the position of h_n[1] i.e. there is an indexing error somewhere in this code or the converter
When you DO pass in h0/c0 explicit (your original case) then the backward parts of the result breaks i.e. 50% correct
In modified.py use the H0C0InputsFlag to switch
I do not think the "if bidirectional:" formatting is contributing anything
Thoughts?
Thanks for the insights ... I think all the issues come from an indexing error. Not sure if it's in this code or in the converter. I'm preparing a simpler test case to verify this.
This is the simplest test case I could think of:
import torch
import torch.nn as nn
import numpy as np
from coremltools import TensorType
from coremltools import convert
def test_lstm(
input_size,
hidden_size,
bidirectional
):
model = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=1,
bias=True,
batch_first=False,
dropout=0.0,
bidirectional=bidirectional
)
SEQUENCE_LENGTH = 1
BATCH_SIZE = 1
input = torch.rand(SEQUENCE_LENGTH, BATCH_SIZE, input_size)
model.eval()
torch_model = torch.jit.trace(model, input)
mlmodel = convert(torch_model, inputs=[TensorType(name="x", shape=input.shape)])
torch_results = torch_model(input)
coreml_results = mlmodel.predict({"x": input.numpy()}, useCPUOnly=True)
torch_output = torch_results[0].detach().numpy().squeeze()
coreml_output = coreml_results[sorted(coreml_results.keys())[0]].squeeze()
print("Input:")
print(input)
print("Torch output:")
print(torch_output)
print("CoreML output:")
print(coreml_output)
# Testing last hidden state and last cell state
torch_hn = torch_results[1][0].detach().numpy().squeeze().flatten()
torch_cn = torch_results[1][1].detach().numpy().squeeze().flatten()
coreml_hn = coreml_results[sorted(coreml_results.keys())[1]].squeeze()
coreml_cn = coreml_results[sorted(coreml_results.keys())[2]].squeeze()
print("Torch hn:")
print(torch_hn)
print("CoreML hn:")
print(coreml_hn)
print("Torch cn:")
print(torch_cn)
print("CoreML cn:")
print(coreml_cn)
# Testing output (differentes if atol <= 1e-5)
np.testing.assert_allclose(torch_output, coreml_output, atol=1e-4)
# Testing last hidden state and last cell state (differents)
np.testing.assert_allclose(torch_hn, coreml_hn, atol=1e-4)
np.testing.assert_allclose(torch_cn, coreml_cn, atol=1e-4)
test_lstm(2, 2, True)
As you can see, the output of the LSTM is almost equal (atol=1e-4) but the last hidden state and last cell state is completely different, seems that coreml only returns the states of the backward pass. Maybe there is an indexing issue, but as you can see, the code is fairly simple, couldn't find the error.
Unidirectional LSTM, works perfectly:
test_lstm(2, 2, False)
TL;DR I suspect the converter ObjC/C++ code makes a mistake with reformats of the h_n/c_n before you see the predict result in Python, but I cannot confirm it atm.
That is a nice, concise testcase, appreciated!
Here more analysis in detail such that someone can help nailing this down.
1) Attached a more verbose version that I used to get new insights.
The full log file is in
NOTE the WARNING
WARNING:root:Tuple detected at graph output. This will be flattened in the converted model.
and I do not know whether that explains the bug, or not?
2) Using the testcase, we see the following output of the bidirectional LSTM. I limited the output precision to .4 to emphasize the result :)
Input, shape:torch.Size([1, 1, 2]):tensor([[[0.7497, 0.6047]]]) Torch output, shape= (4,) :[-0.1497 -0.0001 0.053 -0.0807] CoreML output, shape= (4,) :[-0.1497 -0.0001 0.053 -0.0807] Torch hn, shape= (4,) :[-0.1497 -0.0001 0.053 -0.0807] # <-------------- CoreML hn, shape= (4,) :[ 0.053 -0.0807 0. 0. ] # <-------------- Torch cn , shape= (4,) :[-0.3136 -0.0002 0.1331 -0.302 ] # <-------------- CoreML cn , shape= (4,) :[ 0.1331 -0.302 0. 0. ] # <--------------
Now notice that
CoreML hn, shape= (4,) :[ 0.053 -0.0807 .... ]
is IDENTICAL to the last 2 values of "Torch hn", Torch hn, shape= (4,) :[ ..... 0.053 -0.0807]
CoreML cn , shape= (4,) :[ 0.1331 -0.302 .... ]
is IDENTICAL to the last 2 values of "Torch cn". Torch cn , shape= (4,) :[..... 0.1331 -0.302 ]
That is no coincidence. Someone forget (or added) a size factor 2 (for bidir) when producing the final tensor/array. In other words, it seems like internally in CoreML(converter), the predict result might be ok, but we as end users only see a subslice as CoreML hn/cn which are produced in the "wrong" way?
The question is of course - can that be fixed (yes) and where?
While I produced a debug version of libcoremlpython.so, and checked with lldb breakpoints on LSTM handling, I have not pinned it down. Anyone has better luck?
In even more detail, backtrace for
coreml_results = mlmodel.predict({"x": input.numpy()}, useCPUOnly=True)
is roughly
We call
a) coremltools//models/model.py L302 def predict(self, data, useCPUOnly=False, **kwargs):
b) ./coremlpython/CoreMLPython.mm
PYBIND11PLUGIN(libcoremlpython) {
py::module m("libcoremlpython", "CoreML.Framework Python bindings");
py::class
c) ./coremlpython/CoreMLPython.h public: class Model { py::dict predict(const py::dict& input, bool useCPUOnly);
And then the trail goes stale as I do not see the ObjC function [ predictionFromFeatures ]. I know that is produced when you compile a COreML model in Xcode and we probably do something similar here but I cannot 'step' into it.
Also, I could not confirm the 'shape' of the final predict output via lldb. Some symbols missing
3) Based on the torch code and graph produced , we see in the log file testBLSTM.jul30.output.txt
code: hx = torch.zeros([2, int(max_batch_size), 2], dtype=6, layout=0, device=torch.device("cpu"), pin_memory=False) _8, _9, _10 = torch.lstm(input, [hx, hx], [_7, _6, _5, _4, _3, _2, _1, _0], True, 1, 0., False, True, False) return (_8, (_9, _10))
seems correct, and Torch uses twice hx to feed in the zero h0/c0, whatever
graph:
%67 : Float(1, 1, 4), %68 : Float(2, 1, 2), %69 : Float(2, 1, 2) = aten::lstm(%input, %59, %60, %61, %62, %63, %64, %65, %66) # /Users/hans-web/Library/Python/3.7/lib/python/site-packages/torch/nn/modules/rnn.py:570:0 %70 : (Float(2, 1, 2), Float(2, 1, 2)) = prim::TupleConstruct(%68, %69) %71 : (Float(1, 1, 4), (Float(2, 1, 2), Float(2, 1, 2))) = prim::TupleConstruct(%67, %70)
While that looks reasonable, I am not sure whether the order or 67/68/69 in the output is significant.
4) Also, look at the Torch Python code (not coremltools converter) torch/nn/modules/rnn.py
has a comment in 1.51. and 1.6.0. Does that ring any bells?
#
#
class LSTM(RNNBase):
The story continues based on the previous message:
In even more detail, backtrace for
coreml_results = mlmodel.predict({"x": input.numpy()}, useCPUOnly=True) is roughlyWe call
a) coremltools//models/model.py L302 def predict(self, data, useCPUOnly=False, **kwargs):
- which takes the proxy route 'if' branch
b) ./coremlpython/CoreMLPython.mm PYBIND11PLUGIN(libcoremlpython) { py::module m("libcoremlpython", "CoreML.Framework Python bindings"); py::class
(m, "_MLModelProxy") .def(py::init<const std::string&, bool>()) .def("predict", &Model::predict) # <------ c) ./coremlpython/CoreMLPython.h public: class Model { py::dict predict(const py::dict& input, bool useCPUOnly);
We now look at the ObjC / C++ code in ./coremlpython/CoreMLPython.mm
Specifically, roughly line 72
py::dict Model::predict(const py::dict& input, bool useCPUOnly) { @autoreleasepool { NSError error = nil; MLDictionaryFeatureProvider inFeatures = Utils::dictToFeatures(input, &error); Utils::handleError(error); MLPredictionOptions *options = [[MLPredictionOptions alloc] init]; options.usesCPUOnly = useCPUOnly;
id<MLFeatureProvider> outFeatures = [m_model predictionFromFeatures:static_cast<MLDictionaryFeatureProvider * _Nonnull>(inFeatures)
options:options
error:&error];
I was wondering whether the error is in the converter code (here), or already wrong when we get it back from [m_model predictionFromFeatures] in the 'outFeatures'.
The answer is the latter. With a bit of C++ surgery, we can get a handle on
MLMultiArray h_n = outFeatures -> "28"); MLMultiArray c_n = outFeatures -> "29");
and ask for shape of the buffers (all three [1,1,4] ) and value of those buffers. I see
h_n[0] = 0.052964 h_n[1] = -0.080748 h_n[2] = 0.000000 h_n[3] = 0.000000
c_n[0] = 0.133108 c_n[1] = -0.302025 c_n[2] = 0.000000 c_n[3] = 0.000000
At least on my machine, the values from h_n and c_n are wrong (but identical to the Python output in earlier comments) when we get them from CoreML Bidirectional model.
Let's summarize what this bug seems to mean for Bidirectional LSTMs in CoreML and your Apps. Is this about correct?
1) y = blstm(x) without input h_0/c_0 nor use of output c_n/h_n works fine, and delivers the the correct result 'y' like in PyTorch This was ok in the previous version (last year) or CoreML(tools) as well. Any one-shot recognizer with BLSTMs will use this mode, ok.
2) y, (c_n, h_n) = blstm(x) Same as (1) but we look at the returned c_n/h_n, which are both wrong. (this testcase, testBLSTM.jul30.py https://github.com/apple/coremltools/issues/824#issuecomment-667196355, or derived from the really nice testcase https://github.com/apple/coremltools/issues/824#issuecomment-666202297 ) No idea whether that worked last year or not.
3) y, (c_n, h_n) = blstm(x, (zero, zero)) WITH input h_0/c_0 , and all of returned y, cn, h_n are wrong (testcase in first message, this thread, https://github.com/apple/coremltools/issues/824#issue-667198642)
4) y, (c_n, h_n) = blstm(x, (c_0, h_0)) like (3) and all of return values y, cn, h_n are wrong again (testcase in first message, this thread, https://github.com/apple/coremltools/issues/824#issue-667198642)
In other words, ANY use of the BLSTM input value c_0/h_0 or output c_n/h_n seems to lead to bugs and wrong values. When you provide input h_0/c_0 then output 'y' is wrong as well as the backward pass of LSTM is affected.
Or, if there are streaming application with BLSTMs, which is typically not the case, then all of them affected.
Finally, as we presume that TF converted models are directed to the same CoreML code they might be affected in the same way. Haven't tried but that is what it looks like.
It would be nice to have a fix for such an issue.
Anyway, moving on.
Thanks @leovinus2001. I'll let this opened until someone from CoreML can have a look.
PS: This issue sounds an awful lot like issue #263 which is two years old, ugh. In #263, the suggested workaround is ok but this mangling of the hn/cn outputs from BLSTMs also affects your TestLSTM testcase and the suggested testcase in the PR #843 as far as I can see.
ping
@leovinus2001 Marked as a P1. We will take a look at this before we promote 4.0 out of beta. Thanks for your patience.
🐞Describe the bug
Trace
No
To Reproduce
Results:
System environment (please complete the following information):
Additional context