dotnet / iot

This repo includes .NET Core implementations for various IoT boards, chips, displays and PCBs.
MIT License
2.15k stars 581 forks source link

Cargo causes GpioController to hang on Raspberry Pi 5 Bookworm #2332

Closed skipfire closed 3 weeks ago

skipfire commented 1 month ago

Describe the bug

On a Raspberry Pi 5 with a .NET project that uses GpioController, new GpioController() hangs after installing cargo. The same image does not have a problem on other versions of Pi hardware. I found removing llvm-14-tools restores the functionality, but installing that package by itself does not cause the problem.

Steps to reproduce

# Install .NET
wget https://dot.net/v1/dotnet-install.sh -O dotnet-install.sh
bash ./dotnet-install.sh
echo 'export PATH=$PATH:$HOME/.dotnet' >> ~/.bashrc
echo 'export DOTNET_ROOT=$HOME/.dotnet' >> ~/.bashrc
source ~/.bashrc

# Make test project
dotnet new console -n cargotest
cd cargotest
dotnet add package System.Device.Gpio --version 3.2.0
nano Program.cs

Put the following content into Program.cs

using System.Device.Gpio;
Console.WriteLine("before new");
var controller = new GpioController(PinNumberingScheme.Logical);
Console.WriteLine("after new");
controller.Dispose();
Console.WriteLine("exiting");

Test the behaviors

# do the testing
# it should work first try
dotnet run .
sudo apt install -y cargo
# it should fail on this try, cancel once you see it is not moving forward
dotnet run .
sudo apt remove -y llvm-14-tools
# it should work on this last try
dotnet run .

Expected behavior

new GpioController() should not hang with cargo installed (or any non-dotnet package), in the example all 3 Console.WriteLine calls should execute.

Actual behavior

Only the first Console.WriteLine before new GpioController() executes and then it hangs without continuing or error.

Versions used

Runtime Environment: OS Name: debian OS Version: 12 OS Platform: Linux RID: linux-arm64 Base Path: /home/genmonpi/.dotnet/sdk/8.0.303/

.NET workloads installed: There are no installed workloads to display.

Host: Version: 8.0.7 Architecture: arm64 Commit: 2aade6beb0

.NET SDKs installed: 8.0.303 [/home/genmonpi/.dotnet/sdk]

.NET runtimes installed: Microsoft.AspNetCore.App 8.0.7 [/home/genmonpi/.dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 8.0.7 [/home/genmonpi/.dotnet/shared/Microsoft.NETCore.App]

Other architectures found: None

Environment variables: DOTNET_ROOT [/home/genmonpi/.dotnet]

global.json file: Not found

Learn more: https://aka.ms/dotnet/info

Download .NET: https://aka.ms/dotnet/download

krwq commented 1 month ago

[Triage] Can you try explicitly using libgpiod (new GpioController(PinNumberingScheme.Logical, new LibGpiodDriver())) and sysfs driver (new GpioController(PinNumberingScheme.Logical, new SysFsDriver())) and see if that repros for both? That seems very weird - another thing you can try is reinstalling dotnet and see if maybe that helps - please keep us updated - it would be good to figure out what's causing the hang

skipfire commented 1 month ago

I can try the libgpiod later today. I have tried on multiple fresh installs with dotnet both before and after the cargo install and it made no difference so this isn't a single install issue. The repro steps I provided I built with a fresh image to make it as simple as possible for someone else to be able to reproduce.

skipfire commented 1 month ago

@krwq gave it a shot and LibGpiodDriver failed the same as no driver, however SysFsDriver did pass the constructor. I haven't tried it with any of the rest of the functionality I would need, but the constructor didn't hang up.

Note for anyone trying to replicate, add using System.Device.Gpio.Drivers; to the initial repro code.

pgrawehr commented 1 month ago

Try with new LibGpiodDriver(4), that's the correct setting for Rpi5.

skipfire commented 1 month ago

@pgrawehr LibGpioDriver(4) fails as well.

skipfire commented 1 month ago

@pgrawehr @krwq I have found the problem, there are a pair of circular symbolic links in an installed library and it's getting stuck in an infinite loop in LibGpiodDriverFactory.GetFiles where it callsdirectoriesToProcess.Push(subdirectory); as part of looking for libgpiod.so.*. I'm sure that ignoring all symbolic links would cause other problems and DirectoryInfo.LinkTarget is not available with netstandard2.0. With .NET 8.0 both were showing a LinkTarget of "..".

sudo find -L /lib -type l
find: File system loop detected; ‘/lib/llvm-14/build/Release’ is part of the same file system loop as ‘/lib/llvm-14’.
find: File system loop detected; ‘/lib/llvm-14/build/Debug+Asserts’ is part of the same file system loop as ‘/lib/llvm-14’.

Any suggestions?

pgrawehr commented 1 month ago

Does that also happen if you force it to Libgpiod v1? There should be a special constructor for this.

skipfire commented 1 month ago

Can you provide a bit of code of what you want me to try because from what I see this is before the version matters? This error comes from the first line in LibGpiodDriver when it needs LibGpiodDriverFactory.Instance in the call LibGpiodDriverFactory.VersionedLibgpiodDriver versionedLibgpiodDriver = LibGpiodDriverFactory.Instance.Create(gpioChip);.

pgrawehr commented 1 month ago

@skipfire Sorry, I was on my phone only. I was thinking of using the public LibGpiodDriver(int gpioChip, LibGpiodDriverVersion driverVersion) constructor, but it seems this is calling the problematic factory constructor as well, thus getting the same issue. Looks like you need to either use SysFs or manually remove that circular link until this issue is resolved (which might not be so easy). Thanks for finding that problem.

joperezr commented 1 month ago

FYI @huesla as this is a bug of on the code that is trying to find the right version of libgpiod from different folders.

@skipfire Thank you so much for the great report and for following up and finding the culprit. We discussed during triage with @raffaeler and @pgrawehr and we think that this code shouldn't be there to begin with, as ideally we should be relying on NativeLibrary probing paths as opposed to us doing the probing ourselves. We will consider replacing this code in the next major version. In the meantime, a potential workaround would be to add a HashSet to mark all of the directories that have been visited (or pushed to directoriesToProcess) and to do a quick check before each time we are going to push to ensure that this directory is not in the visited (in order to prevent loops). This would work as a temporary mitigation of the issue. @skipfire (or @huesla) does that solution sound good to you? If so, would you be willing to contribute a PR with it? Thanks again for all of your help with this!

skipfire commented 1 month ago

@joperezr I am willing to contribute and I tried your suggestion already, but the directory keeps extending so it isn't seen as the same. Given /lib/myfolder/mysubfolder -> /lib/myfolder Will will get /lib/myfolder/mysubfolder/myfolder/mysubfolder/myfolder etc and I could not find a way to determine that I had already visited that folder with .NETStandard 2.0 compatibility. If .NET Standard 2.0 compatibility wasn't a factor then with .NET 5+ the LinkTarget is available and would make it possible.

raffaeler commented 1 month ago

@skipfire My guess is that the path is always computed as relative instead of keeping a root path and adding the relative path for each visited library. Since you are already debugging it, you should be able to print a debug trace for each symbolic link and see how the path is resolved. If it sounds confusing, you may probably post here a verbose log of the path being visited, the symbolic link and the new path being constructed.

Thanks!

skipfire commented 1 month ago

@raffaeler I understand that, however the symbolic link detail is not available in .NET Standard 2.0 so the data is not available to the runtime. Trying to access the symbolic data generates a compile exception because of the .NET Standard 2.0 build requirement of the project.

raffaeler commented 1 month ago

@skipfire I understand that, but you can still use reflection (or dyanamic) for debugging purposes ...

skipfire commented 1 month ago

Which I did in a test harness and it keeps appending folders with no visible reference to the real path.

skipfire commented 1 month ago

Here is one example, /lib/llvm-14/build/Release/ is /lib/llvm-14/

/lib/llvm-14/build $ ls -al
total 16
drwxr-xr-x 4 root root 4096 Jul 19 21:01 .
drwxr-xr-x 7 root root 4096 Jul 19 21:01 ..
lrwxrwxrwx 1 root root    2 Feb 17  2023 Debug+Asserts -> ..
lrwxrwxrwx 1 root root   10 Feb 17  2023 include -> ../include
lrwxrwxrwx 1 root root    6 Feb 17  2023 lib -> ../lib
lrwxrwxrwx 1 root root    2 Feb 17  2023 Release -> ..
lrwxrwxrwx 1 root root    8 Feb 17  2023 share -> ../share
drwxr-xr-x 2 root root 4096 Feb 17  2023 unittests
drwxr-xr-x 3 root root 4096 Jul 19 20:58 utils

DirectoryInfo for /build/llvm-14/build/Release with .NET 8.0 provides some data, but I cannot find any property, even in the debug view, that says /lib/llvm-14/.

{/lib/llvm-14/build/Release}
    Attributes: ReadOnly | Directory | ReparsePoint
    CreationTime: {17/02/2023 05:57:29}
    CreationTimeUtc: {17/02/2023 11:57:29}
    Exists: true
    Extension: ""
    FullName: "/lib/llvm-14/build/Release"
    FullPath: "/lib/llvm-14/build/Release"
    LastAccessTime: {19/07/2024 21:00:29}
    LastAccessTimeUtc: {20/07/2024 02:00:29}
    LastWriteTime: {17/02/2023 05:57:29}
    LastWriteTimeUtc: {17/02/2023 11:57:29}
    LinkTarget: ".."
    Name: "Release"
    OriginalPath: "/lib/llvm-14/build/Release"
    Parent: {/lib/llvm-14/build}
    Root: {/}
    UnixFileMode: OtherExecute | OtherRead | GroupExecute | GroupRead | UserExecute | UserWrite | UserRead
raffaeler commented 1 month ago

@skipfire A quick workaround could be to declare the pinvoke for realpath in libc which return the full canonicalized path name

raffaeler commented 1 month ago

The realpath declaration is the following: https://github.com/dotnet/runtime/blob/4fbd498cc25ca9ca2ca6cac4764866bf08d1988f/src/libraries/Common/src/Interop/Unix/System.Native/Interop.RealPath.cs#L4

pgrawehr commented 1 month ago

I'm confused, as I cannot reproduce the issue.

pi@raspberrypi:/lib $ sudo find -L /lib -type l
/lib/aarch64-linux-gnu/qt-default/qtchooser/default.conf
find: Dateisystemschleife erkannt; ‘/lib/chromium-browser/libs’ ist ein Teil der gleichen Schleife wie ‘/lib/chromium-browser’.

So there's a loop and I've added traces to see how the path builds up, but it appears to abort automatically, maybe after the length reaches a certain limit.

skipfire commented 1 month ago

@pgrawehr Is that after installing cargo and running on a Pi 5 with Bookworm? This does not happen on other Pi models or before Bookworm.

pgrawehr commented 1 month ago

In my case, the loop is apparently generated by chromium, but this shouldn't make a difference, should it? I'm running on a Pi4 with Bookworm, which also should behave the same here.

skipfire commented 1 month ago

The Chromium loop wasn't breaking for me either even though it existed. The problem is specific to a Pi 5, I tried on a Pi 3A+, 3B, 3B+, Pi 4, Zero 2, and Pi 5 and Pi 5 was the only spot it didn't work.

pgrawehr commented 1 month ago

That is really confusing then (and makes no sense). Can we just limit the recursion depth to maybe 6 or so levels? The files we're looking for shouldn't be so deeply nested. As noted above, we just need a quick-and-dirty hack for now and we should remove this entire search method later.

huesla commented 1 month ago

Hi, I have just read this topic and hope to find some time on the weekend to look into this. Since you are that deep into the code, you might just skip the factory and explicitly instantiate the desired versioned driver temporarily until there is a fix. I hope this piece of code gets replaced soon...

huesla commented 1 month ago

Are there any valid reasons though for symlink loops to exist?

skipfire commented 1 month ago

@huesla specifying a driver doesn't skip the problem code, it's in the factory to get the right chipset which is called even if you pass a specific driver.

huesla commented 1 month ago

Sorry I meant in the constructor of LibGpioDriver.cs Instead of calling the factory, directly instantiate the versioned driver (depending on the libgpiod version installed on your system)

For example

... _driver = new LibGpiodV1Driver(chipNumber); ...

skipfire commented 1 month ago

@huesla that was one of the recommendations that was tried and it doesn't avoid the code. The problem is in GetFiles which is called by GetInstalledLibraries which is called by LibGpiodDriverFactory which is called by Instance which is called by LibGpiodDriver(int gpioChip) regardless of chip version.

huesla commented 1 month ago

I think the previous recommendation was to pass the LibGpiodDriverVersion enum value to:

public LibGpiodDriver(int gpioChip, LibGpiodDriverVersion driverVersion)
{
    LibGpiodDriverFactory.VersionedLibgpiodDriver versionedLibgpiodDriver = LibGpiodDriverFactory.Instance.Create(gpioChip, driverVersion);
    _driver = versionedLibgpiodDriver.LibGpiodDriver;
    Version = versionedLibgpiodDriver.DriverVersion;
}

which would call the problematic factory, as you correctly stated.

What I meant was for example:

public LibGpiodDriver(int gpioChip, LibGpiodDriverVersion driverVersion)
{
    // LibGpiodDriverFactory.VersionedLibgpiodDriver versionedLibgpiodDriver = LibGpiodDriverFactory.Instance.Create(gpioChip, driverVersion);
    _driver = new LibGpiodV1Driver(gpioChip);
    Version = LibGpiodDriverVersion.V1;
}

This entirely skips the factory, which is only needed to automatically choose the appropriate LibGpiodDriver version depending on the system. @joperezr please contact me if you need support in switching to NativeLibrary or so...

pgrawehr commented 1 month ago

Are there any valid reasons though for symlink loops to exist?

@huesla I can't think of any, but unfortunately, that's how it is. These loops are created by external packages, even well-known ones such as the chromium browser. We can't change when some third-party packages do stupid things.

skipfire commented 1 month ago

Posted PR #2335 to address this.