Potential pytorch incompatibility

ElliottKasoar commented 1 year ago

This is not an issue I've encountered, but having followed the FTorch build instructions, the version of libtorch/pytorch installed may mean that FTorch is incompatible with the model saved in the examples, as this pip installs torch in a (new) virtual environment.

This would only lead to errors if breaking changes were made to the TorchScript format between the versions, and in many cases the same pip-installed torch would be used anyway.

jatkinson1000 commented 1 year ago

Mmmmm this is a good point - I can see it being an issue if someone uses newer features of pytorch in their model, but then implements FTorch linked to an older version of LibTorch without those features.

I can't immediately think of an easy way around this other than recommending that users ensure that their LibTorch version is at least as new as the one their model was built with.

I don't think we need to change anything code-wise, but it would be interesting to know what the error raised would be so that we can recognise this in future should users come across it.

ElliottKasoar commented 12 months ago

In terms of LibTorch, I don't think users can go too far back, as I get errors when trying to build FTorch of the form:

/home/ek/ICCS/fortran-pytorch-lib/fortran-pytorch-lib/ctorch.cpp:235:18: error: ‘synchronize’ is not a member of ‘torch::cuda’
  235 |     torch::cuda::synchronize();

when running make for versions <= 1.7 (which is probably worth noting in itself).

For versions between 1.8 and 1.10, I can build FTorch successfully, but encounter errors when going through the example of the form:

 ./resnet_infer_fortran ../saved_resnet18_model_cpu.pt
[ERROR]: terminate called after throwing an instance of 'c10::Error'
  what():  isTuple()INTERNAL ASSERT FAILED at "../aten/src/ATen/core/ivalue_inl.h":1306, please report a bug to PyTorch. Expected Tuple but got String
Exception raised from toTuple at ../aten/src/ATen/core/ivalue_inl.h:1306 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd324cac302 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd324ca8c9b in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7fd324ca918e in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libc10.so)
frame #3: <unknown function> + 0x3877287 (0x7fd315c50287 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x3878325 (0x7fd315c51325 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::SourceRange::highlight(std::ostream&) const + 0x36 (0x7fd313327e06 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::ErrorReport::what() const + 0x2c5 (0x7fd313308b85 in /home/ek/ICCS/libtorch-1.9/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4664 (0x7fd3251d9664 in /home/ek/lib/test/lib/libftorch.so)
frame #8: __ftorch_MOD_torch_module_load + 0x9 (0x7fd3251dcd99 in /home/ek/lib/test/lib/libftorch.so)
frame #9: <unknown function> + 0x17d8 (0x56364eecb7d8 in ./resnet_infer_fortran)
frame #10: <unknown function> + 0x117f (0x56364eecb17f in ./resnet_infer_fortran)
frame #11: __libc_start_main + 0xf3 (0x7fd324d27083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11be (0x56364eecb1be in ./resnet_infer_fortran)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fd324f15d4a
#1  0x7fd324f14ee5
#2  0x7fd324d4608f
#3  0x7fd324d4600b
#4  0x7fd324d25858
#5  0x7fd3122908d0
#6  0x7fd31229c37b
#7  0x7fd31229b358
#8  0x7fd31229bd10
#9  0x7fd3121e7bfe
#10  0x7fd3121e85b9
#11  0x7fd315c4ff49
#12  0x7fd315c51324
#13  0x7fd313327e05
#14  0x7fd313308b84
#15  0x7fd3251d9663
#16  0x7fd3251dcd98
#17  0x56364eecb7d7
#18  0x56364eecb17e
#19  0x7fd324d27082
#20  0x56364eecb1bd
#21  0xffffffffffffffff
Aborted

I think versions 1.11+ work ok all the way through.

(This is testing against torch==2.0.1 installed with pip, python 3.9.18)

jatkinson1000 commented 12 months ago

Interesting. The first error is cuda related which suggests that torch is trying to use some GPU routines somewhere, and you used the CPU only binary(?). Looking at the line referenced in our code it is annotated with a FIXME. I'm not clear as to why the run thinks that the out pointer is_cuda however!

Perhaps this is just an out of date issue and we require libtorch >= 1.8

On the latter, the advice here seems to be 'use the latest' -_- A similar issue raises the possibility of CPU/GPU incompatibility.

Perhaps most relevant suggests it may be an issue when saving the model to TorchScript with one version of LibTorch, and then running from Fortran with another. This is something that could definitely be tested and would be useful to know - if so it would need going into a readme and perhaps mean the 'preferred' approach is to link FTorch to a venv-installed LibTorch.

jatkinson1000 commented 5 months ago

This came up when I was doing work that led to #100

It should be documented somewhere that we should have libtorch and pytorch versions matching.

jatkinson1000 commented 5 months ago

This should be added to the troubleshooting and/or FAQ documentation.

jatkinson1000 commented 2 months ago

After discussion with @TomMelt we should put a note in troubleshooting to ask users to use consistent versions, or point them to where to examine this if they have issues.

We will tackle this as a small issue in an upcoming hackathon.

Cambridge-ICCS / FTorch

Potential pytorch incompatibility #37