dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.37k stars 177 forks source link

Linux native binary portability issues #218

Closed dsyme closed 3 years ago

dsyme commented 3 years ago

This is a tracking issue for Linux binary comptibility issues for the native code component.

Currently we build this on Ubuntu 18.04


Original issue:

The libLibTorchSharp.so we create has a GLIBC_2.30 dependency. I don't fully understand why - it could be as simple as the fact that we build on an Ubunutu 20.04 machine, though it would be good to control the dependency better

Ubuntu 18.04 only has GLIBC_2.27 by default or something like that.

On an Ubuntu 28.04 Colab machine:

ldd /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so

gnu/libpthread.so.0: version `GLIBC_2.30' not found (required by /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so)
/root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /root/.nuget/packages/torchsharp/0.91.52458/runtimes/linux-x64/native/libLibTorchSharp.so)
    linux-vdso.so.1 (0x00007fffcdd83000)
    /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 (0x00007f80bad9d000)
    libtorch.so => not found
    libc10.so => not found
    libtorch_cpu.so => not found
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f80bab7e000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f80ba7f5000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f80ba457000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f80ba23f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f80b9e4e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f80bb00d000)
    libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f80b9c33000)
    liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f80b9a0d000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f80b9809000)
dsyme commented 3 years ago

We are now building on Ubuntu 18.04. I will leave this open as I expect there are problems with this and our build should likely pin down the libc and libc++ runtimes.

dsyme commented 3 years ago

We still have this issue:

/root/.nuget/packages/torchsharp/0.91.52475/runtimes/linux-x64/native/libLibTorchSharp.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /root/.nuget/packages/torchsharp/0.91.52475/runtimes/linux-x64/native/libLibTorchSharp.so)

Looking at https://stackoverflow.com/questions/46809303/how-to-static-linking-to-glibc-in-cmake as a solution

dsyme commented 3 years ago

From chat with @migueldeicaza

[22:22] Don Syme So libtorch is a single set of Linux binaries, and we want to add one binary to that using the same dependency assumptions they're using.

On the whole that's what we seem to be doing - we're picking up the Cmake settings from Libtorch distro and everything seems generally ok. However somehow the way we're compiling on Ubuntu 18.04 (with a few updated packages I think) seems to cause problems when running on other Ubuntu 18.04 ​ PyTorch says this:

PyTorch is supported on Linux distributions that use glibc >= v2.17, which include the following:

Arch Linux, minimum version 2012-07-15
    CentOS, minimum version 7.3-1611
    Debian, minimum version 8.0
    Fedora, minimum version 24
    Mint, minimum version 14
    OpenSUSE, minimum version 42.1
    PCLinuxOS, minimum version 2014.7
    Slackware, minimum version 14.2
    Ubuntu, minimum version 13.04

My specific problem is that binaries built with Ubuntu 18.04 on Azure CI don't run on the Ubuntu 18.04 containers used by Google Colab.

I determined the approximate reason. The Ubuntu 18.04 CI machines in Azure must have some updated packages e.g. they have an updated /usr/lib/x86_64-linux-gnu/libstdc++.so.6 and using

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep LIBCXX

gives

...
GLIBCXX_3.4.24
GLIBCXX_3.4.25 
GLIBCXX_3.4.26 
GLIBCXX_3.4.27 
GLIBCXX_3.4.28 

Whereas the colab containers only report:

...
GLIBCXX_3.4.24
GLIBCXX_3.4.25 

I guess I should be using a more controlled Ubuntu 18.04 container for the build

dsyme commented 3 years ago

I have switched to using a 16.04 container from one of the Microsoft container registry containers, see the list available here:

https://mcrflowprodcentralus.data.mcr.microsoft.com/mcrprod/dotnet-buildtools/prereqs?P1=1616630425&P2=1&P3=1&P4=xAN4nwxX9ps%2BMi75FMzu0iGuhA7luhLsZKUGf0Q9fFU%3D&se=2021-03-25T00%3A00%3A25Z&sig=j1uhQmj8EAbZqaSyGS%2Fwz0ETxwrGVhN3WFwX4OpNz4w%3D&sp=r&sr=b&sv=2015-02-21

dsyme commented 3 years ago

220 and related work pushed directly to "master" fixes this problem