facebookarchive / AICamera

Demonstration of using Caffe2 inside an Android application.
Other
348 stars 136 forks source link

Using PyTorch 1.0 and app crashed with the error "android A/libc Fatal signal 6 (SIGABRT), code -6" #68

Open cedrickchee opened 5 years ago

cedrickchee commented 5 years ago

Hi,

I am experiencing similar error as issue #54 but the problem is different. I have looked at issue #17 as well.

In my case, I am using PyTorch 1.0 preview and Python interface to Caffe2 implementation of ONNX (not the deprecated ONNX-Caffe2).

My system environment:

If I run with the two pb files, squeeze_init_net.pb and squeeze_predict_net.pb orignated from the AICamera project assets directory, the Android app runs fine.

If I replaced the two pb files with the one generated from my own code, the Android app crashed immediately when I try to run it.

Here's my code for that step in this Jupyter notebook. It handles exporting the same SqueezeNet model from torchvision to run on Android devices.

Here's the log from adb logcat:

11-06 22:37:45.009 31778-31778/? I/zygote: Late-enabling -Xcheck:jni
11-06 22:37:45.490 31778-31808/facebook.f8demo E/F8DEMO: Attempting to load protobuf netdefs...
11-06 22:37:45.580 31778-31812/facebook.f8demo D/OpenGLRenderer: HWUI GL Pipeline
11-06 22:37:45.657 31778-31808/facebook.f8demo E/F8DEMO: done.
11-06 22:37:45.657 31778-31808/facebook.f8demo E/F8DEMO: Instantiating predictor...
11-06 22:37:45.673 31778-31812/facebook.f8demo I/Adreno: QUALCOMM build                   : 8e59954, I0be83d0d26
                                                         Build Date                       : 09/22/17
                                                         OpenGL ES Shader Compiler Version: EV031.21.02.00
                                                         Local Branch                     : O17A
                                                         Remote Branch                    : 
                                                         Remote Branch                    : 
                                                         Reconstruct Branch               : 
11-06 22:37:45.676 31778-31812/facebook.f8demo D/vndksupport: Loading /vendor/lib/hw/gralloc.msm8994.so from current namespace instead of sphal namespace.
11-06 22:37:45.745 31778-31812/facebook.f8demo I/Adreno: PFP: 0x00000000, ME: 0x00000000

                                                         --------- beginning of crash
11-06 22:37:45.755 31778-31808/facebook.f8demo A/libc: Fatal signal 6 (SIGABRT), code -6 in tid 31808 (AsyncTask #1), pid 31778 (facebook.f8demo)
11-06 22:37:45.760 31778-31812/facebook.f8demo I/zygote: android::hardware::configstore::V1_0::ISurfaceFlingerConfigs::hasWideColorDisplay retrieved: 0
11-06 22:37:45.761 31778-31812/facebook.f8demo I/OpenGLRenderer: Initialized EGL, version 1.4
11-06 22:37:45.761 31778-31812/facebook.f8demo D/OpenGLRenderer: Swap behavior 2
11-06 22:37:45.786 31778-31778/facebook.f8demo I/CameraManagerGlobal: Connecting to camera service

By running the app in debug mode, when the crash happened, the error is pointing to the line of code that says caffe2::Predictor in the native-lib.cpp:

extern "C"
void
Java_facebook_f8demo_ClassifyCamera_initCaffe2(
        JNIEnv* env,
        jobject /* this */,
        jobject assetManager) {
    AAssetManager *mgr = AAssetManager_fromJava(env, assetManager);
    alog("Attempting to load protobuf netdefs...");
    loadToNetDef(mgr, &_initNet,   "squeeze_init_net.pb");
    loadToNetDef(mgr, &_predictNet,"squeeze_predict_net.pb");
    alog("done.");
    alog("Instantiating predictor...");
    _predictor = new caffe2::Predictor(_initNet, _predictNet);
    alog("done.")
}

Can someone please help me? I have been struggling with this issue for almost 2 days now. I understand that you are really busy, I think we all are, but I hope at least you response to this issue with some direction I can take or suggestion on things that I should try.

Thank you in advance.

RuABraun commented 5 years ago

I have the same issue.

cedrickchee commented 5 years ago

Hi @Nimitz14 Please see this Twitter thread started by me to get some context of the problem. Peter also chipped in our convos. Peter, who did much of the work of the new LibTorch C++ front-end and a moderator of the official PyTorch forums wrote a great tutorial on loading a PyTorch model in C++.

He pointed us to this direction:

Thomas Viehmann is working on a direct C++ LibTorch port for Android devices. Looks promising and might make the development on mobile smoother!

In Thomas's article, I think the most important gist is:

I believe that the C++ tutorial shows how things should work: you have ATen and torch in C++ - they feel just like PyTorch on Python - load your model and are ready to go! No cumbersome ONNX, no learning Caffe2 enough to integrate it to your mobile app. Just PyTorch and libtorch!

There are currently 2 approaches to ship a neural network to Android devices:

  1. Updating the AICamera demo to work with Caffe2 from PyTorch/master
  2. Compile and get LibTorch working on Android (be it arm/arm64 processor, x86 arch, etc.)

I think this they are suggesting we take the direction set in approach 2 moving forward. I am not too sure though. If I have to do the guesswork, the state of things in this area should be clearer once PyTorch drop the preview tag from 1.0.

RuABraun commented 5 years ago

Thank you for the additional info @cedrickchee ! I don't have the time to wait until someone manages to get libtorch to work on android though, so I guess the only hope is a) to try other methods of converting from onnx to caffe2 b) tensorflow

cedrickchee commented 5 years ago

Yeah, I guess those are the best options for now in your circumstances.

I just watch these series of interview with Soumith Chintala, the co-creator of PyTorch which is part of the new course, "Intro to Deep Learning with PyTorch" on Udacity by FAIR.

Hybrid Frontend And JIT Compiler (video)

If I interpret this correctly, I think I finally get a sense of how PyTorch core team think about production deployment for 1.0 from the creator himself:

the shortcomings of ONNX which is a standard which means all PyTorch models couldn't be exported because the standard hasn't developed fast enough to cover that. So which means it was turning out that models that are more complicated wouldn't be ONNX exportable which is why they thought they want a robust solution and that's the 1.0 story. Instead of these separate steps of exporting and then importing it into a different framework, they have kind of like squished all these steps into one, one thing we can do in a hybrid frontend and that does sound that ideal.

I just want to put this out there, so in case anyone bumped into similar issue and left in the dark, this might be useful for you. Thank you for your attention.

ptrblck commented 5 years ago

Hi @cedrickchee, I would like to correct, that I'm not Peter Goldsborough from the PyTorch core dev team, who did the awesome work on libtorch and created the tutorials. I'm just a random guy moderating the forum (with other great guys) and working with/on PyTorch in my spare time. ;)

Best, ptrblck

cedrickchee commented 5 years ago

Hi @ptrblck,

Oops, sorry. I stand corrected.

Cheers, Cedric