Closed ilous12 closed 5 years ago
Hi @ilous12 this is really good work, thank you. I think this is the first time someone has tried to run deeplabv3+ via TfLite on ArmNN, and actually the missing functionalilty is not that bad:
MUL, ADD, RESIZE_BILINEAR, BATCH_TO_SPACE_ND, SPACE_TO_BATCH_ND and SUB are all supported by ArmNN, and just need adding to the TfLite parser. @brunomorishita actually added some of these recently, and his code is available in the development branch at https://review.mlplatform.org/#/admin/projects/ml/armnn
ARG_MAX recently got added to Compute Library but hasn't been integrated into ArmNN yet.
CAST is a bit harder, it depends what is actually happening in the model. We don't have this functionality in ArmNN yet, and it might not be present in Compute Library either.
Then we have the ParseReshape issue - I think this is a limitation just of the TfLite parser, as ArmNN can now handle all kinds of reshapes.
So some of this is "easy" to fix in ArmNN, some is harder.
deeplab v3 is a network that we are trying to support, but I can't confirm a timeline for it. Are you in a position to help add support?
Many thanks, Matthew
@MatthewARM I understood a current status. Unfortunately, I am deep learning new but If you need little help, try on
Hi @ilous12 if you are willing to help, a really good first step would be to get the latest master Arm NN and Compute Library and try your test again, that will give us a good picture of the remaining work.
The links to the master branches can be found on our new developer website here: https://mlplatform.org/contributing/
ok. I will try tomorrow.
Hi @MatthewARM I got a result. see below.
./TFLite 1
Optimisation mode: CpuAcc terminating with uncaught exception of type armnn::ParseException: Buffer #89 has 0 bytes. For tensor: [1,33,33,256] expecting: 1115136 bytes and 278784 elements. at function CreateConstTensor [/Users/ilous12/armnn-devenv/armnn/src/armnnTfLiteParser/TfLiteParser.cpp:1808] Aborted134|alphaplus:/data/local/tmp $ ./TFLite 2
Optimisation mode: GpuAcc terminating with uncaught exception of type armnn::ParseException: Buffer #89 has 0 bytes. For tensor: [1,33,33,256] expecting: 1115136 bytes and 278784 elements. at function CreateConstTensor [/Users/ilous12/armnn-devenv/armnn/src/armnnTfLiteParser/TfLiteParser.cpp:1808] Aborted
my tflite file was below test.tflite.zip
input/output
input node is
sub_7
and output node isResizeBilinear_3
my code TFLite.cpp.zip
what's #89. Is there verbose mode?
Hi, @MatthewARM
did you have any update related tensorflowlite lite?
Thanks @ilous12 we'll try those steps and see. Buffer #89
will be one of the intermediate values in the network, and for some reason our Tensorflow Lite parser can't handle it. We'll take a look,
Hi @ilous12 ,
Recently I push some commits adding support for some of the operations in deeplab v3 model for tflite. It should work now with this model: deeplabv3_257_mv_gpu
The operations I pushed were not merged in the development branch yet, so you'll have to get my patches. They are available at: https://review.mlplatform.org/#/q/status:open
Please let me know if this works for you.
thanks @brunomorishita
How can I get your patches? can you guide?
thanks @brunomorishita
How can I get your patches? can you guide?
git fetch https://review.mlplatform.org/ml/armnn refs/changes/04/704/1 && git checkout FETCH_HEAD
I downloaded your patches can you share your example for deeplab? I want to compare your code and my test code.
Thanks. guys. you did. finally it work. I will implement sample on android and check for semantic label.
Hi. guys. I tried to run deeplab. you see below.
Thanks @brunomorishita I tried next steps and I saw invalid semantic label it work on tensorflow lite with deeplabv3_257_mv_gpu.tflite but I think armnn got problem.
Hi @ilous12 where you have " engineConfig->device = armnn::Compute::GpuAcc;" it's probably worth a quick try with CpuRef to see if the output from our reference (non-accelerated) implementation is different. That will help track down the problem.
Unfortunately It didn't work. did you check my code have no problem? CpuRef : Invalid CpuAcc : Invalid GpuAcc : Invalid
I thought my code looks fine, But I want to check that most likely something is going wrong with handling the input buffer or output.
At armnnwrapper.cc:143 you have `memcpy(engineConfig->input, input, 125725721);which seems strange? Shouldn't that be
12572573*sizeof(float)` to match the definition of ArmnnEngineConfig::input at line 49?
You were right. I fixed. I try again.
We tried test but not enough. We will test armnn and tensorflow lite then compare the results. I expected same result. is it right?
our code src.zip
<rgb input file: read, 1x257x257x3> - <armnn> - < 1x257x257x21, label output array>
@ilous12 might not be exactly the same, if ArmNN does rounding differently to TfLite, but should be very close.
Hi @ilous12 I can't see anything else obviously wrong with the code so I don't know why you would get the wrong result.
The only thing I can suggest is that in the latest master branch (at https://review.mlplatform.org/#/admin/projects/ml/armnn) we have added a 'debug' flag to OptimizerOptions
. Setting this flag will make ArmNN print the contents of all tensors to standard output, so that you can compare with similar debug output from Tensorflow Lite and figure out where it is going wrong inside the network.
Actually, if you haven't tried already, you should try with the latest master code anyway, just in case this is a bug that we've fixed without realising it.
Good luck, Matthew
@ilous12 this is a floating-point model, isn't it? Not quantised?
I'm just trying to figure out what could be going wrong.
Hi @oms1226 we should be getting something very close to Tensorflow Lite's answer - just small differences sometimes due to different arithmetic implementations.
As you are seeing something very different, it looks like there is perhaps a bug. If you can figure out which layer in the network is producing the wrong output, that would be very helpful.
Many thanks, Matthew
Thank @MatthewARM. I want to compare with similar debug output from Tensorflow Lite and figure out where it is going wrong inside the network. So, m_Debug's default value changed false to true in .../include/armnn/INetwork.hpp as like below. OptimizerOptions() : m_ReduceFp32ToFp16(false) , m_Debug(true) {} How can I figure out debug's log and so on?
Hi @oms1226 when you run with m_Debug set to true, it will print all tensor values on standard output, so you can capture them in a file for debugging.
By the way, changing the default value is overkill - it would be more usual to set it in your application with something like:
armnn::OptimizerOptions options;
options.m_Debug = true;
engineConfig_->optNet_ = Optimize(*(engineConfig_->network_), {engineConfig_->device_}, engineConfig_->runtime_->GetDeviceSpec(), options);
I feel so sorry for @MatthewARM . I can't find out all tensor values on standard output. I think standard output is adb log on android device. As your advice, I let m_Debug true as below. I wonder what's wrong? And let me know print example for all tensor value. I am turning to you for help.
engineConfig->inputBindingInfo = engineConfig->parser->GetNetworkInputBindingInfo(0, "sub7"); engineConfig->outputBindingInfo = engineConfig->parser_->GetNetworkOutputBindingInfo(0, "ResizeBilinear3"); armnn::IRuntime::CreationOptions options; // default options engineConfig->options = options; engineConfig->runtime = armnn::IRuntime::Create(engineConfig->options); engineConfig->device_ = armnn::Compute::GpuAcc; armnn::OptimizerOptions op_options; op_options.mDebug = true; engineConfig->optNet = Optimize(*(engineConfig->network), {engineConfig->device}, engineConfig->runtime_->GetDeviceSpec(), opoptions); armnn::NetworkId networkIdentifier = 0; engineConfig->runtime->LoadNetwork(networkIdentifier, std::move(engineConfig->optNet)); engineConfig->networkIdentifier = networkIdentifier;
engineConfig->inputsize = inputsize; engineConfig->outputsize = outputsize; engineConfig->input_ = new float[inputsize]; engineConfig->output_ = new float[outputsize]; engineConfig->inputTensor = MakeInputTensors(engineConfig->inputBindingInfo, engineConfig->input); engineConfig->outputTensor = MakeOutputTensors(engineConfig->outputBindingInfo, engineConfig->output_);
Hi, @MatthewARM I uploaded network output file. Can you figure out which layer in the network is producing the wrong output?
https://drive.google.com/open?id=1wJ4RGXJWllWXPj2vyawVDojApB9E3jKF
I'm checking a layer "sub_7", I saw ArmNN does rounding differently to TfLite, but should be very close.
On Tensorflow Lite [ 0.9921875, 0.9921875, 0.9921875]
On Armnn [0.992188, 0.992188, 0.992188]
Is it right?
Hi, @MatthewARM @brunomorishita Please let us know the progress of the work.
Hi @MatthewARM If you need help about this issue, contact us
Hello @ilous12 , I am a colleague of @MatthewARM and have picked up a ticket which looks related to this issue. Have you now had success with perfect segmentation running the model on Armnn?
@ilous12 if the debug flag isn't working on Android, can you at least print the numbers coming out of your network and compare them to TfLite? At the moment I can't figure out if what we're seeing is a bug in ArmNN or some sort of numerical precision issue.
By the way, the debug flag eventually causes the code to get called in src/backends/reference/workloads/Debug.cpp which uses std::cout. If std::cout isn;t working, maybe you could hack in something that works. I'm sorry I don't really know much about what debug / printing features are available from an Android application.
Hi @kevmay01 @MatthewARM Unfortunately, currently tensorflow-lite only supports to print output for input node and output node, We are looking for a way. If let us know all network output on tensorflow-lite.
@MatthewARM If we write an armnn app and tensorflow app with console output for deeplab, can you check it?
Hi. @MatthewARM @kevmay01 We checked armnn and TfLite then compare the results. You will download the link below. https://drive.google.com/open?id=1VXsSHYulFutt6uKorgWaQfX_lS43SgrV
Hello @ilous12 The problem appears to be that the offical Tensorflow Lite model deeplabv3_257_mv_gpu.tflite includes convolution layers with dilation values of 2 and 4 which Arm NN does not support. The current Arm NN master falls back to using a dilation value of 1 and this would appear to be the cause of the drop in accuracy that you are seeing.
However if you train and convert a model that only uses dilation values of 1 you should see the same results between Arm NN and Tensorflow Lite. I converted your frozen_inference_graph.pb which you attached here to a tflite file and was able to get the same results running on Arm NN and Tf Lite.
Thanks for reply. I understood your comment. I will try to train. after I will check, I share you results
Hi @kevmay01 What did you change for the dilation values to 1? Did you use tflite_convert? Can you share me?
I converted the file you attached to this ticket: frozen_inference_graph.pb I removed the preprocessing layers and some of the last layers. This creates a file which I think is not what you are ultimately looking for but I was able to use it to prove the output from TfLite and Armnn were the same.
tflite_convert \ --graph_def_file=/home/kevmay01/Downloads/frozen_inference_graph.pb \ --output_file=/home/kevmay01/test_new.tflite \ --input_shapes=1,129,129,3 \ --input_arrays=sub_7 \ --output_arrays=ResizeBilinear_2 \ --inference_type=FLOAT \ --inference_input_type=FLOAT \ --std_dev_values=128 \ --mean_value=128
You will need to train a new model I think and figure out how to convert and optimize it to make a similar model to the official deeplabv3_257_mv_gpu.tflite file, but only using default dilation values.
@ilous12 I think the information you need regarding creating a model with dilation set to 1 is in this ticket: https://github.com/tensorflow/tensorflow/issues/26474
There is also an example attached to that ticket of an identical deeplabv3_257_mv_gpu.tflite, but with dilation values set to 1.
Hi @kevmay01 . Do you have a shchdule to work convolution layers with dilation values of 2 and 4? Or not?
Hi @MatthewARM @kevmay01 I created a model with dilation value to 1(by output_stride=32) and I saw my shape. nice work.
But I got some problems. You will see below.
Tensorflow Lite Result
Armnn Result
Composite image (tflite + armnn)
I think the shape of the result is a little different, can you check?
Thanks @ilous12 I see what you mean about the output being different. I'll try to get someone to look at whether this indicates a bug but I'm not sure whether that will be this week or later.
Have you tried with the Arm NN CpuRef backend? That doesn't use Compute Library so it eliminates one possible source of errors.
Hi @MatthewARM I got a result and I will se belows
CpuRef
Gpu
We have a crash issue to run cpuAcc on Release 19.02. You will se belows
com.test.sample#00 pc 00000000004c184c /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (arm_compute::NEScaleKernel::scale_nhwc(arm_compute::Window const&)+792) <---- This point com.test.sample#01 pc 000000000032fa28 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so com.test.sample#02 pc 000000000032f4f8 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (arm_compute::CPPScheduler::run_workloads(std::ndk1::vector<std::ndk1::function<void (arm_compute::ThreadInfo const&)>, std::ndk1::allocator<std::ndk1::allocator>>&)+220) com.test.sample#03 pc 000000000032f7c8 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel, arm_compute::IScheduler::Hints const&)+348) com.test.sample#04 pc 000000000037a4f4 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (arm_compute::NEScale::run()+84) com.test.sample#05 pc 00000000002b9d00 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (armnn::NeonResizeBilinearWorkload::Execute() const+304) com.test.sample#06 pc 00000000002389f8 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (armnn::LoadedNetwork::Execute()+140) com.test.sample#07 pc 0000000000237a70 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (armnn::LoadedNetwork::EnqueueWorkload(std::ndk1::vector<std::ndk1::pair<int, armnn::ConstTensor>, std::ndk1::allocator<std::ndk1::pair<int, armnn::ConstTensor>>> const&, std::ndk1::vector<std::ndk1::pair<int, armnn::Tensor>, std::ndk1::allocator<std::ndk1::pair<int, armnn::Tensor>>> const&)+2380) com.test.sample#08 pc 0000000000258d1c /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn.so (armnn::Runtime::EnqueueWorkload(int, std::ndk1::vector<std::ndk1::pair<int, armnn::ConstTensor>, std::ndk1::allocator<std::ndk1::pair<int, armnn::ConstTensor>>> const&, std::ndk1::vector<std::ndk1::pair<int, armnn::Tensor>, std::ndk1::allocator<std::ndk1::pair<int, armnn::Tensor>>> const&)+408) com.test.sample#09 pc 0000000000003c00 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn_mobile_jni.so (armnn::ArmnnEngineContext::inference(float)+64) com.test.sample#10 pc 0000000000002bc8 /data/app/com.test.sample-IsGLQ5XgkmORLc4rtHSI7w==/lib/arm64/libarmnn_mobile_jni.so (Java_com_skt_tnn_ArmnnNativeWrapper_inference+484)
Hi @MatthewARM
We measured elapsed times and I noticed delayed API. You will see belows.
parser_ = armnnTfLiteParser::ITfLiteParser::Create(); 0ms
parser_->CreateNetworkFromBinaryFile(); 73ms
parser_->GetNetworkInputBindingInfo(); 0ms
optNet_ = Optimize(); 28ms
runtime_->LoadNetwork(); 2270ms <------ so slow
Can you check this problem?
Hi @MatthewARM
When I use "optimizerOptions.m_Debug = true" with GpuAcc on 19.02, I got a crash.
armnn/src/armnn/LoadedNetwork.cpp:192: const armnn::IWorkloadFactory &armnn::LoadedNetwork::GetWorkloadFactory(const armnn::Layer &) const: assertion "(IWorkloadFactory::IsLayerSupported(layer, {}, reasonIfUnsupported))&&("Factory does not support layer")" failed Aborted (core dumped)
Thanks.
Hi @MatthewARM we found wrong result using op "ResizeBilinear_2" in armnn for upscale
input | node | output | status |
---|---|---|---|
1x9x9x320 | AvgPool2D | 1x1x1x320 | same |
1x1x1x320 | ResizeBilinear | 1x9x9x256 | same |
1x9x9x21 | ResizeBilinear_1 | 1x9x9x21 | same |
1x9x9x21 | ResizeBilinear_2 | 1x257x257x21 | wrong |
Can you check "Up Scale" function by resizeBilinear op?
We attached output files. resizeBilinear.zip
Hi @ilous12 sorry for the long delay in replying. The only question I know how to answer is this one:
We measured elapsed times and I noticed delayed API. You will see belows.
parser_ = armnnTfLiteParser::ITfLiteParser::Create(); 0ms parser_->CreateNetworkFromBinaryFile(); 73ms parser_->GetNetworkInputBindingInfo(); 0ms optNet_ = Optimize(); 28ms runtime_->LoadNetwork(); 2270ms <------ so slow
Can you check this problem?
I expect that the 2.3 seconds here is spent compiling OpenCL kernels for the GpuAcc backend. We're looking into various ways to cache those kernels so that it doesn't have to be done every time you run your program, but it's a hard problem.
I'm sorry but I don't yet have an answer for your other problems.
Hi @ilous12 the support for dilation in DepthwiseConvolution has been merged to master so hopefully your original model will now work!
On the 'upscale' issue, do you happen to know which resize method is in use? Should be one of: BILINEAR = 0 NEAREST_NEIGHBOR = 1 BICUBIC = 2 AREA = 3
Many thanks, Matthew
hi, I tried to apply deeplabv3+.
I built on Deeplab and I got frozen.pb [frozen_inference_graph.pb.zip]
When I converted, I got [deeplab_257_quantized.tflite.zip]
Finally, I ran on android (Samsung Note8, supported OpenCL + NEON) with below code
and I got below error. I understood supported ops. Is that problem?