libtorch model load failed

FlyingAnt2018 commented 1 year ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 18.04): Linux ubuntu-ThinkStation-P920 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Apollo installed from (source or binary): source
Apollo version (3.5, 5.0, 5.5, 6.0): 7.0

Steps to reproduce the issue:

1. Build a simple git model according to pytorch offical tutorial.
1. Write a simple c++ project to run load and inference API of libtorch, and the test will success.
1. If move the code snnipet of step2 to apollo prediction submodule, the libtorch's load model API will run failed.

Supporting materials (screenshots, command lines, code/script snippets):

snippets1. code to export jit model

import torch
import torchvision
# An instance of your model.
model = torchvision.models.resnet18()
# An example input you would normally provide to your model's forward() method.
example = torch.rand(1, 3, 224, 224)
# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
traced_script_module = torch.jit.script(model, example)#torch.jit.script(model)
output = traced_script_module(torch.ones(1, 3, 224, 224))
print(output[0, :5])
traced_script_module.save("traced_resnet_model.pt")

snippets2. code to run c++ example

#include <iostream>
#include <string>
#include <utility>
#include <vector>
#include <memory>
#include <cstring>
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <torch/script.h>
#include <numeric>
#include <algorithm>
#include <fstream>

using namespace std;
bool fileExists(const std::string& filename){
    std::ifstream file(filename);
    return file.good();
}
int test(const std::string model_path){
    if(!fileExists(model_path)){
        std::cout << " file not found "<< std::endl;
        return -1;
    }
    torch::Device device_(torch::kCPU);
    torch::jit::script::Module module;// = torch::jit::load(model_path);
    if (torch::cuda::is_available() {
        std::cout << "CUDA is available" <<endl;
        device_ = torch::Device(torch::kCUDA);
    }
    try {
        // Deserialize the ScriptModule from a file using torch::jit::load().
        module = torch::jit::load(model_path, device_);
    }
    catch (const c10::Error& e) {
        std::cout<< e.what() << std::endl;
        std::cerr << "error loading the model\n";
        return -1;
    }
    // Create a vector of inputs.
    std::vector<torch::jit::IValue> inputs;
    inputs.push_back(torch::ones({1, 3, 224, 224}));

    // Execute the model and turn its output into a tensor.
    at::Tensor output = module.forward(inputs).toTensor();
    std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << '\n';
    return 0;
}
int main(int argc, char* argv[])
{
    std::string model_path("traced_resnet_model2.pt");
    std::cout << "model path= " << model_path << std::endl;
    test(model_path);
    return 0;
}

snippets 3.CMakeList.txt

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(custom_ops)
set(CMAKE_CXX_STANDARD 14)

#link_directories(/usr/local/libtorch_naive/lib)

link_directories(/usr/local/libtorch/lib)
include_directories(/usr/local/libtorch/include)
include_directories(/usr/local/libtorch/include/torch/csrc/api/include)

add_executable(example-app test_resnet.cpp)
target_link_libraries(example-app
torch
torch_cpu
torch_python
torch_global_deps
c10
)
set_property(TARGET example-app PROPERTY CXX_STANDARD 14)

When i copy snippets2 to apollo prediction module, error occured at model load stage, and report data format is not supported. But a have used same libtorch library and jit model both in apollo env and snippets2. I guess this may caused by apollo setting which affect libtorch runtime env. I need your help, please. Details, debug image is here[https://discuss.pytorch.org/uploads/default/original/3X/c/b/cb7278cd55cf4c584d1d6c94ab6172864a36900c.jpeg] left of image below show runtime debug mode var state, and right half of the image shows running demo about snippet 2, whoes var module is instanced successfuly.

daohu527 commented 1 year ago

Are you sure it is caused by torch::jit::load, the prediction module also uses torch::jit::load, I think this may be normal.

FlyingAnt2018 commented 1 year ago

Are you sure it is caused by torch::jit::load, the prediction module also uses torch::jit::load, I think this may be normal.

My apollo repo has been changed to use Cmake to build full system. And the change is equivalent, all models are loaded failed.May be the bug is imported by the cmake setting.

daohu527 commented 1 year ago

We can't give too much advice on this issue, I think you can start with the error message, apollo currently has no plans to support cmake

FlyingAnt2018 commented 1 year ago

Thanks for your kindness, I am trying to align with bazel, to locate error.

	@.***

@.*** |

---- Replied Message ---- | From | @.> | | Date | 04/20/2023 19:39 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [ApolloAuto/apollo] libtorch model load failed (Issue #14894) |

We can't give too much advice on this issue, I think you can start with the error message, apollo currently has no plans to support cmake

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

FlyingAnt2018 commented 1 year ago

We can't give too much advice on this issue, I think you can start with the error message, apollo currently has no plans to support cmake

Hi, bro. I wander if there is any special trick when get jit verson torch model. I successfuly run official apollo code, when i debugged at prediction module i found that all exist models loaded successful, but the instance "torch_vehiclemodel" here looks like picture below. But, my jit model c++ instance [Refer to the official implementation here ] looks like picture below.

In two pictures, apollo official version model instance's Member variables "torch_vehiclemodel.ivalue.target.slots" is a list type. But, my resnet jit model instance in picture 2's element variables "slots_"looks like a std::_vector_base type. I guess there are special operations when export pytorch model to jit version.

ApolloAuto / apollo