ewagner12 / all-ways-egpu

Configure eGPU as primary under Linux Wayland desktops
MIT License
186 stars 12 forks source link

eGPU not enabled after running script #6

Open Faraclas opened 1 year ago

Faraclas commented 1 year ago

I have an RTX 3060 Ti inside a Razer Core X enclosure. This is working great (dual)booting into windows, however there is no display in linux [gentoo, systemd, gnome, waylad].

I took a look at the service status:

○ all-ways-egpu.service - Configure eGPU as primary under Wayland desktops
     Loaded: loaded (/etc/systemd/system/all-ways-egpu.service; enabled; preset: disabled)
     Active: inactive (dead) since Fri 2022-12-30 10:27:22 EST; 39s ago
   Duration: 75ms
    Process: 1000 ExecStart=all-ways-egpu boot (code=exited, status=0/SUCCESS)
   Main PID: 1000 (code=exited, status=0/SUCCESS)
        CPU: 29ms

Dec 30 10:27:22 gentoo systemd[1]: Started Configure eGPU as primary under Wayland desktops.
Dec 30 10:27:22 gentoo all-ways-egpu[1011]: find: ‘/sys/class/drm/card[0-9]*/card[0-9]*-*/../device/driver’: No such file or directory
Dec 30 10:27:22 gentoo all-ways-egpu[1000]: No eGPU detected
Dec 30 10:27:22 gentoo all-ways-egpu[1014]: /usr/bin/all-ways-egpu: line 199: echo: write error: No such device
Dec 30 10:27:22 gentoo all-ways-egpu[1014]: /usr/bin/all-ways-egpu: line 205: /sys/bus/pci/drivers/i915/unbind: Permission denied
Dec 30 10:27:22 gentoo all-ways-egpu[1014]: /usr/bin/all-ways-egpu: line 206: /sys/bus/pci/devices/0000:0000:00:02.0/remove: No such file or directory
Dec 30 10:27:22 gentoo all-ways-egpu[1014]: /usr/bin/all-ways-egpu: line 212: echo: write error: No such device
Dec 30 10:27:22 gentoo systemd[1]: all-ways-egpu.service: Deactivated successfully.

The first error about not being able to find: --> No such file or directory, I verfied the following files exist:

elias@gentoo ~ $ ls /sys/class/drm/card0/card0-DP-3/device/device/driver/module/drivers/
pci:i915
elias@gentoo ~ $ ls /sys/class/drm/card1/card1-DP-5/device/device/driver/module/drivers/
pci:nvidia  pci:nvidia-nvswitch

I am happy to help debug etc to get this working, Thank you for your scripts!

ewagner12 commented 1 year ago

Hi I'm on holiday away from my eGPU until January 3rd so I'll be able to test this more after that, but just from your description it seems like a similar issue to #5 where the glob is not expanding properly. Are you using bash or a different shell?

Faraclas commented 1 year ago

Yes, this looks similar. I am using bash.

On Fri, Dec 30, 2022, 1:28 PM ewagner12 @.***> wrote:

Hi I'm on holiday away from my eGPU until January 3rd so I'll be able to test this more after that, but just from your description it seems like a similar issue to #5 https://github.com/ewagner12/all-ways-egpu/issues/5 where the glob is not expanding properly. Are you using bash or a different shell?

— Reply to this email directly, view it on GitHub https://github.com/ewagner12/all-ways-egpu/issues/6#issuecomment-1368043385, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNTUVFI3WDDPSNWZW54PXTWP4ST7ANCNFSM6AAAAAATM7HHTA . You are receiving this because you authored the thread.Message ID: @.***>

Faraclas commented 1 year ago

After reading issue #5 , I took a look at my script and it seems that change is already included.

Line 229: `
EGPU_DETECT=0

for CARD in $(lspci -d ::0300 | cut -c -7); do
    set -- /sys/bus/pci/devices/0000:"$CARD"
    for BOOT_VGA_PATH in "$@"; do
        if grep -q "$CARD" < "$USER_IDS_DIR"/egpu-bus-ids; then
            echo "$BOOT_VGA_PATH"  | tee -a "$USER_IDS_DIR"/bind-paths
            mount -n --bind -o ro "$USER_IDS_DIR"/1  "$BOOT_VGA_PATH"/boot_vga
            EGPU_DETECT=1
        else
            if grep -q "1" < "${BOOT_VGA_PATH}"/boot_vga; then
                echo "$BOOT_VGA_PATH"  | tee -a "$USER_IDS_DIR"/bind-paths
                mount -n --bind -o ro "$USER_IDS_DIR"/0 "$BOOT_VGA_PATH"/boot_vga
            fi
        fi
    done
done

`

ewagner12 commented 1 year ago

Ok so a couple of things here.

First off, I noticed that the script is trying to find the file "/sys/bus/pci/devices/0000:0000:00:02.0/remove" which isn't working because there's an extra "0000:". Did you use the guided setup or did you manually enter the bus IDs? If you manually enter the ids they should be in a form like "00:02.0".

Second just a note on how this script works, Method 2 is the recommended method and if it works for you, you don't need to setup the internal bus ids to remove. Did you try just setting the eGPU as primary with method 2 and not entering any internal gpu ids to remove?

Lastly, I also took a look at the other issues you're seeing here and I believe I worked out the issues causing the output you're seeing with this part of the script. I'm in the process of testing these on my end to make sure they work correctly and I'll let you know when I push these changes to the github repo.

Hopefully with all of these changes this should fix this issue.

Faraclas commented 1 year ago

Thank you so much for your quick replies. I will address your questions in-line;

"First off, I noticed that the script is trying to find the file "/sys/bus/pci/devices/0000:0000:00:02.0/remove" which isn't working because there's an extra "0000:". Did you use the guided setup or did you manually enter the bus IDs? If you manually enter the ids they should be in a form like "00:02.0"."

"Second just a note on how this script works, Method 2 is the recommended method and if it works for you, you don't need to setup the internal bus ids to remove. Did you try just setting the eGPU as primary with method 2 and not entering any internal gpu ids to remove?"

"Lastly, I also took a look at the other issues you're seeing here and I believe I worked out the issues causing the output you're seeing with this part of the script. I'm in the process of testing these on my end to make sure they work correctly and I'll let you know when I push these changes to the github repo."

*I will look forward to these changes in the script.

One other question that I have for you is how systemd is setting up the services to run. I noticed that after I logged in (to the eGPU not working) there was a pop-up to enter the root password to allow a user service to run. I am wondering if this could also be a potential cause. From my (very limited) understanding, once a user is logged in and the (internal) display is up, it's too late to "do stuff".

On Wed, Jan 4, 2023, 5:05 PM ewagner12 @.***> wrote:

Ok so a couple of things here.

First off, I noticed that the script is trying to find the file "/sys/bus/pci/devices/0000:0000:00:02.0/remove" which isn't working because there's an extra "0000:". Did you use the guided setup or did you manually enter the bus IDs? If you manually enter the ids they should be in a form like "00:02.0".

Second just a note on how this script works, Method 2 is the recommended method and if it works for you, you don't need to setup the internal bus ids to remove. Did you try just setting the eGPU as primary with method 2 and not entering any internal gpu ids to remove?

Lastly, I also took a look at the other issues you're seeing here and I believe I worked out the issues causing the output you're seeing with this part of the script. I'm in the process of testing these on my end to make sure they work correctly and I'll let you know when I push these changes to the github repo.

Hopefully with all of these changes this should fix this issue.

— Reply to this email directly, view it on GitHub https://github.com/ewagner12/all-ways-egpu/issues/6#issuecomment-1371483248, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNTUVCT76MKUWTYXSC2TWTWQXXY7ANCNFSM6AAAAAATM7HHTA . You are receiving this because you authored the thread.Message ID: @.***>

ewagner12 commented 1 year ago

You're correct that you don't need to do anything to undo method 2 and that's all correct on what you should try once the changes are pushed.

One reason method 2 didn't work in the first place could be because the guided setup was giving it the wrong IDs in the first place. To help debug this could you post your output of lspci?

The systemd prompt is expected when you login if you say yes to both prompts during setup. There's 2 different systemd services, one that runs before the display manager starts and is supposed to remove the iGPU and one that runs after the login and can restart the iGPU after login. With the gnome wayland desktop that lets you get a picture on the laptop screen while still keeping the eGPU as primary.

Faraclas commented 1 year ago

$ lspci 0000:00:00.0 Host bridge: Intel Corporation 11th Gen Core Processor Host Bridge/DRAM Registers (rev 01) 0000:00:02.0 VGA compatible controller: Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics] (rev 01) 0000:00:04.0 Signal processing controller: Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant (rev 01) 0000:00:06.0 System peripheral: Intel Corporation RST VMD Managed Controller 0000:00:07.0 PCI bridge: Intel Corporation Tiger Lake-LP Thunderbolt 4 PCI Express Root Port #0 (rev 01) 0000:00:07.2 PCI bridge: Intel Corporation Tiger Lake-LP Thunderbolt 4 PCI Express Root Port #2 (rev 01) 0000:00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator module (rev 01) 0000:00:0a.0 Signal processing controller: Intel Corporation Tigerlake Telemetry Aggregator Driver (rev 01) 0000:00:0d.0 USB controller: Intel Corporation Tiger Lake-LP Thunderbolt 4 USB Controller (rev 01) 0000:00:0d.2 USB controller: Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0 (rev 01) 0000:00:0d.3 USB controller: Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #1 (rev 01) 0000:00:0e.0 RAID bus controller: Intel Corporation Volume Management Device NVMe RAID Controller 0000:00:12.0 Serial controller: Intel Corporation Tiger Lake-LP Integrated Sensor Hub (rev 20) 0000:00:14.0 USB controller: Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller (rev 20) 0000:00:14.2 RAM memory: Intel Corporation Tiger Lake-LP Shared SRAM (rev 20) 0000:00:15.0 Serial bus controller: Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0 (rev 20) 0000:00:15.1 Serial bus controller: Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1 (rev 20) 0000:00:16.0 Communication controller: Intel Corporation Tiger Lake-LP Management Engine Interface (rev 20) 0000:00:19.0 Serial bus controller: Intel Corporation Tiger Lake-LP Serial IO I2C Controller #4 (rev 20) 0000:00:19.1 Serial bus controller: Intel Corporation Tiger Lake-LP Serial IO I2C Controller #5 (rev 20) 0000:00:1c.0 PCI bridge: Intel Corporation Device a0b8 (rev 20) 0000:00:1d.0 PCI bridge: Intel Corporation Device a0b3 (rev 20) 0000:00:1e.0 Communication controller: Intel Corporation Tiger Lake-LP Serial IO UART Controller #0 (rev 20) 0000:00:1f.0 ISA bridge: Intel Corporation Tiger Lake-LP LPC Controller (rev 20) 0000:00:1f.3 Multimedia audio controller: Intel Corporation Tiger Lake-LP Smart Sound Technology Audio Controller (rev 20) 0000:00:1f.4 SMBus: Intel Corporation Tiger Lake-LP SMBus Controller (rev 20) 0000:00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-LP SPI Controller (rev 20) 0000:39:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:3a:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:3a:01.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:3a:02.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:3a:03.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:3a:04.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] (rev 02) 0000:4d:00.0 PCI bridge: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] (rev 02) 0000:4e:01.0 PCI bridge: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] (rev 02) 0000:4f:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1) 0000:4f:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1) 0000:71:00.0 Network controller: Qualcomm QCA6390 Wireless Network Adapter (rev 01) 0000:72:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5260 PCI Express Card Reader (rev 01) 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller (rev 01) 10000:e1:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO

image

ewagner12 commented 1 year ago

@Faraclas The changes were just pushed in commit 7de0f7c If you could download the latest github version, re-run the setup and see if you have any issues that would be great!

Faraclas commented 1 year ago

@ewagner12 Thank you for the changes. I cloned the repo, ran the install command, and then used the guided setup. When I tried to boot (with the eGPU connected), I got stuck in the boot screen and never made it to the GDM login. I was able to boot into the system with the eGPU powered off.

ewagner12 commented 1 year ago

Ok can you post the output of the all-way-egpu status command?

Faraclas commented 1 year ago

I can try again. However there are a few things I found out on my system that might make a difference.

ewagner12 commented 1 year ago

If I had to guess, I would guess that the iGPU is being removed correctly, but the nvidia card is not being picked up by X/Wayland for whatever reason. If that's the case here's some things I would try based on my experience with this:

ewagner12 commented 1 year ago

FYI I just pushed commit 618fd62 which improves Method 1 removal reliability and sometimes prevents black screens at least on my end. So you may want to try the latest git again and see if anything changes for you.