Closed bokenator closed 1 year ago
I am following the instruction to convert a pytorch model weight into a torchscript weight
First note that we would now recommend using safetensors to export weights from the Python side as it's much simpler, see this export script.
One thing that could help debugging this kind of issue is inspecting the content of the weight file via cargo run --example tensor-tools ls resnet18.ot
. Looking at your file it indeed seems that the batch norm running mean and var are not included. Not sure where your resnet python definition is coming from, is it possible that it's set with track_running_stats=False
or something like this (python doc).
I played around with it some more. The running_mean
and running_val
is definitely loaded by the model because I can print their values out. It seems that torch.jit.script
is either not converting them into the torchscript representation or torch_script_module.save
is not saving it into the .ot
file.
The resnet
definition is straight from torchvision.models
. Whether loading the weights manually with load_state_dict
or instantiating the model directly with models.resnet18(pretrained=True)
yield the same result.
I also tried out export using safetensors, but I'm seeing the following errors when running tensor-tools
against the output files:
Error: Internal torch error: PytorchStreamReader failed reading zip archive: failed finding central directory
Exception raised from valid at ../caffe2/serialize/inline_container.cc:184 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7efc4ec5a6bb in /opt/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7efc4ec555ef in /opt/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x3ca (0x7efc529571fa in /opt/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::init() + 0xad (0x7efc52957a0d in /opt/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x13a (0x7efc5295b1ea in /opt/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) + 0x28d (0x7efc53b1955d in /opt/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) + 0x92 (0x7efc53b199f2 in /opt/libtorch/lib/libtorch_cpu.so)
frame #7: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) + 0xd1 (0x7efc53b19b21 in /opt/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xddd59 (0x559de2652d59 in target/debug/examples/tensor-tools)
frame #9: <unknown function> + 0x6917f (0x559de25de17f in target/debug/examples/tensor-tools)
frame #10: <unknown function> + 0x64cd1 (0x559de25d9cd1 in target/debug/examples/tensor-tools)
frame #11: <unknown function> + 0x596cb (0x559de25ce6cb in target/debug/examples/tensor-tools)
frame #12: <unknown function> + 0x5cd6e (0x559de25d1d6e in target/debug/examples/tensor-tools)
frame #13: <unknown function> + 0x5cd31 (0x559de25d1d31 in target/debug/examples/tensor-tools)
frame #14: <unknown function> + 0x37643c (0x559de28eb43c in target/debug/examples/tensor-tools)
frame #15: <unknown function> + 0x5cd0a (0x559de25d1d0a in target/debug/examples/tensor-tools)
frame #16: <unknown function> + 0x666be (0x559de25db6be in target/debug/examples/tensor-tools)
frame #17: <unknown function> + 0x29d90 (0x7efc4ea01d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: __libc_start_main + 0x80 (0x7efc4ea01e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: <unknown function> + 0x2f545 (0x559de25a4545 in target/debug/examples/tensor-tools)
Did you ensure that the safetensors file is named with a .safetensors
suffix? (that's used to trigger using safetensors for decoding, a bit sad that we have this kind of implicit magic but anyway)
That's the magic we needed!
I will open a PR over the weekend to update the tutorial for weight conversion.
Thanks so much for your help!
I am following the instruction to convert a pytorch model weight into a torchscript weight, but when I am trying to load the weights, I'm getting the following error:
Error messages may show different layer each time it's run, but they seem to always be associated with a batch norm layer. For example, I also get these error messages:
or
The model definition itself seems to be correct because I'm able to load the
resnet18.ot
file in the release files properly and do inference. However, the weights that I tried to convert myself are not working.Here's python function I wrote to do the conversion, I have attempted both the trace method and the annotation method:
For reference, you can download the problematic weights at https://ml47.s3.amazonaws.com/resnet18.ot.
Any suggestion would be greatly appreciated.