Frogging-Family / linux-tkg

linux-tkg custom kernels
GNU General Public License v2.0
1.33k stars 166 forks source link

`eevdf` Scheduler Somehow Causes Game Launch Crashes with Vanilla Proton #930

Closed ThisNekoGuy closed 4 months ago

ThisNekoGuy commented 5 months ago

Describe the bug

I found(?) that using the default eevdf scheduler breaks compatibility with Proton somehow via D3D11(?) Wine errors:

I've tested this by compiling the same kernel with exactly the same configuration with the only difference being switching to pds as a working comparison. I've got no idea why eevdf specifically causes the issue, but I figured it would be important to mention since it's the only option available for at least 6.9 over here right now (which is where I originally started having this issue and downgraded to 6.8 to be able to test between 6.8.9-eevdf and 6.8.9-pds).

Tk-Glitch commented 5 months ago

Sounds like https://gitlab.freedesktop.org/drm/amd/-/issues/3343

ThisNekoGuy commented 4 months ago

That issue says that I would need Above 4G Decoding and Resizable Bar disabled though, and I have them enabled in my BIOS, so... how...? :thinking:

ptr1337 commented 4 months ago

That issue says that I would need Above 4G Decoding and Resizable Bar disabled though, and I have them enabled in my BIOS, so... how...? 🤔

That does not seem to be correct and just one way to reproduce the issue. The best way would be either to revert the commit or test if the issues is present on 6.8.8

ThisNekoGuy commented 4 months ago

I made a patch that reverts the commit for 6.8 but it immediately fails git format-patch -1 a6ff969fe9cbf369e3cd0ac54261fec1122682ec --stdout > ~/linux-tkg/0000-revert-commit-a6ff969fe9cbf369e3cd0ac54261fec1122682ec.mypatch

 -> ######################################################
 -> 
 -> Applying your own linux-6.8 patch /home/neko-san/linux-tkg/linux68-tkg-userpatches/0000-revert-commit-a6ff969fe9cbf369e3cd0ac54261fec1122682ec.mypatch
 -> 
 -> ######################################################
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
Hunk #1 succeeded at 620 (offset 3 lines).
Hunk #2 FAILED at 1271.
Hunk #3 succeeded at 1380 (offset -8 lines).
Hunk #4 succeeded at 1404 (offset -7 lines).
Hunk #5 succeeded at 1568 (offset -7 lines).
Hunk #6 succeeded at 1577 (offset -7 lines).
1 out of 6 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_object.c.rej
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
Hunk #1 succeeded at 244 (offset -6 lines).
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
Hunk #1 succeeded at 137 with fuzz 1 (offset 4 lines).
Hunk #2 succeeded at 408 (offset 5 lines).
Hunk #3 succeeded at 549 (offset 5 lines).
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
 -> exit cleanup done
ptr1337 commented 4 months ago

https://github.com/CachyOS/kernel-patches/commit/19dd3a1f0aaa0deb61964e9d88a361804e3c6a24

Here you can find a adusted one

ThisNekoGuy commented 4 months ago

@ptr1337 That patch also fails: 6.8.8:

patching file Documentation/ABI/testing/sysfs-driver-hid-asus
patching file arch/Kconfig
Reversed (or previously applied) patch detected!  Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file arch/Kconfig.rej
patching file drivers/hid/Makefile
patching file drivers/hid/hid-asus-core.c (renamed from drivers/hid/hid-asus.c)
patching file drivers/hid/hid-asus-rog.c
patching file drivers/hid/hid-asus-rog.h
patching file drivers/hid/hid-asus.h
patching file drivers/hid/hid-ids.h
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
Reversed (or previously applied) patch detected!  Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c.rej
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
Reversed (or previously applied) patch detected!  Skipping patch.
6 out of 6 hunks ignored -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_object.c.rej
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
Reversed (or previously applied) patch detected!  Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_object.h.rej
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
Reversed (or previously applied) patch detected!  Skipping patch.
3 out of 3 hunks ignored -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c.rej
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
Reversed (or previously applied) patch detected!  Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h.rej

6.8.9:

patching file Documentation/ABI/testing/sysfs-driver-hid-asus
patching file arch/Kconfig
Reversed (or previously applied) patch detected!  Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file arch/Kconfig.rej
patching file drivers/hid/Makefile
patching file drivers/hid/hid-asus-core.c (renamed from drivers/hid/hid-asus.c)
patching file drivers/hid/hid-asus-rog.c
patching file drivers/hid/hid-asus-rog.h
patching file drivers/hid/hid-asus.h
patching file drivers/hid/hid-ids.h
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
Hunk #1 succeeded at 620 (offset 1 line).
Hunk #2 succeeded at 1275 (offset 1 line).
Hunk #3 succeeded at 1392 (offset 1 line).
Hunk #4 succeeded at 1419 (offset 2 lines).
Hunk #5 succeeded at 1583 (offset 2 lines).
Hunk #6 succeeded at 1591 (offset 2 lines).
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
Hunk #1 succeeded at 137 with fuzz 1 (offset 4 lines).
Hunk #2 succeeded at 408 (offset 5 lines).
Hunk #3 succeeded at 534 (offset 5 lines).
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
 -> exit cleanup done
ptr1337 commented 4 months ago

I talk about the diff and not complete patch:

From 4dd993adabd5f3cc200c11385f269640bee6936e Mon Sep 17 00:00:00 2001
From: Peter Jung <admin@ptr1337.dev>
Date: Sun, 5 May 2024 11:28:43 +0200
Subject: [PATCH] revert drm/amdgpu: fix visible VRAM handling during faults

a6ff969fe9cbf369e3cd0ac54261fec1122682ec

Signed-off-by: Peter Jung <admin@ptr1337.dev>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 22 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 22 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 61 ++++++++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  3 --
 5 files changed, 57 insertions(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index ec888fc6ead8..0a4b09709cfb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -819,7 +819,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)

    p->bytes_moved += ctx.bytes_moved;
    if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
-       amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+       amdgpu_bo_in_cpu_visible_vram(bo))
        p->bytes_moved_vis += ctx.bytes_moved;

    if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index ce733e3cb35d..dd90241248a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -619,7 +619,8 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
        return r;

    if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
-       amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+       bo->tbo.resource->mem_type == TTM_PL_VRAM &&
+       amdgpu_bo_in_cpu_visible_vram(bo))
        amdgpu_cs_report_moved_bytes(adev, ctx.bytes_moved,
                         ctx.bytes_moved);
    else
@@ -1273,25 +1274,23 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, bool evict)
 void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
              struct amdgpu_mem_stats *stats)
 {
-   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
-   struct ttm_resource *res = bo->tbo.resource;
    uint64_t size = amdgpu_bo_size(bo);
    struct drm_gem_object *obj;
    unsigned int domain;
    bool shared;

    /* Abort if the BO doesn't currently have a backing store */
-   if (!res)
+   if (!bo->tbo.resource)
        return;

    obj = &bo->tbo.base;
    shared = drm_gem_object_is_shared_for_memory_stats(obj);

-   domain = amdgpu_mem_type_to_domain(res->mem_type);
+   domain = amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
    switch (domain) {
    case AMDGPU_GEM_DOMAIN_VRAM:
        stats->vram += size;
-       if (amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+       if (amdgpu_bo_in_cpu_visible_vram(bo))
            stats->visible_vram += size;
        if (shared)
            stats->vram_shared += size;
@@ -1392,7 +1391,10 @@ vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
    /* Remember that this BO was accessed by the CPU */
    abo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;

-   if (amdgpu_res_cpu_visible(adev, bo->resource))
+   if (bo->resource->mem_type != TTM_PL_VRAM)
+       return 0;
+
+   if (amdgpu_bo_in_cpu_visible_vram(abo))
        return 0;

    /* Can't move a pinned BO to visible VRAM */
@@ -1415,7 +1417,7 @@ vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)

    /* this should never happen */
    if (bo->resource->mem_type == TTM_PL_VRAM &&
-       !amdgpu_res_cpu_visible(adev, bo->resource))
+       !amdgpu_bo_in_cpu_visible_vram(abo))
        return VM_FAULT_SIGBUS;

    ttm_bo_move_to_lru_tail_unlocked(bo);
@@ -1579,7 +1581,6 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
  */
 u64 amdgpu_bo_print_info(int id, struct amdgpu_bo *bo, struct seq_file *m)
 {
-   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
    struct dma_buf_attachment *attachment;
    struct dma_buf *dma_buf;
    const char *placement;
@@ -1588,11 +1589,10 @@ u64 amdgpu_bo_print_info(int id, struct amdgpu_bo *bo, struct seq_file *m)

    if (dma_resv_trylock(bo->tbo.base.resv)) {
        unsigned int domain;
-
        domain = amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
        switch (domain) {
        case AMDGPU_GEM_DOMAIN_VRAM:
-           if (amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+           if (amdgpu_bo_in_cpu_visible_vram(bo))
                placement = "VRAM VISIBLE";
            else
                placement = "VRAM";
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index fa03d9e4874c..be679c42b0b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -250,6 +250,28 @@ static inline u64 amdgpu_bo_mmap_offset(struct amdgpu_bo *bo)
    return drm_vma_node_offset_addr(&bo->tbo.base.vma_node);
 }

+/**
+ * amdgpu_bo_in_cpu_visible_vram - check if BO is (partly) in visible VRAM
+ */
+static inline bool amdgpu_bo_in_cpu_visible_vram(struct amdgpu_bo *bo)
+{
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct amdgpu_res_cursor cursor;
+
+   if (!bo->tbo.resource || bo->tbo.resource->mem_type != TTM_PL_VRAM)
+       return false;
+
+   amdgpu_res_first(bo->tbo.resource, 0, amdgpu_bo_size(bo), &cursor);
+   while (cursor.remaining) {
+       if (cursor.start < adev->gmc.visible_vram_size)
+           return true;
+
+       amdgpu_res_next(&cursor, cursor.size);
+   }
+
+   return false;
+}
+
 /**
  * amdgpu_bo_explicit_sync - return whether the bo is explicitly synced
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 1d71729e3f6b..6417cb76ccd4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -133,7 +133,7 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,

        } else if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
               !(abo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) &&
-              amdgpu_res_cpu_visible(adev, bo->resource)) {
+              amdgpu_bo_in_cpu_visible_vram(abo)) {

            /* Try evicting to the CPU inaccessible part of VRAM
             * first, but only set GTT as busy placement, so this
@@ -403,55 +403,40 @@ static int amdgpu_move_blit(struct ttm_buffer_object *bo,
    return r;
 }

-/**
- * amdgpu_res_cpu_visible - Check that resource can be accessed by CPU
- * @adev: amdgpu device
- * @res: the resource to check
+/*
+ * amdgpu_mem_visible - Check that memory can be accessed by ttm_bo_move_memcpy
  *
- * Returns: true if the full resource is CPU visible, false otherwise.
+ * Called by amdgpu_bo_move()
  */
-bool amdgpu_res_cpu_visible(struct amdgpu_device *adev,
-               struct ttm_resource *res)
+static bool amdgpu_mem_visible(struct amdgpu_device *adev,
+                  struct ttm_resource *mem)
 {
+   u64 mem_size = (u64)mem->size;
    struct amdgpu_res_cursor cursor;
+   u64 end;

-   if (!res)
-       return false;
-
-   if (res->mem_type == TTM_PL_SYSTEM || res->mem_type == TTM_PL_TT ||
-       res->mem_type == AMDGPU_PL_PREEMPT)
+   if (mem->mem_type == TTM_PL_SYSTEM ||
+       mem->mem_type == TTM_PL_TT)
        return true;
-
-   if (res->mem_type != TTM_PL_VRAM)
+   if (mem->mem_type != TTM_PL_VRAM)
        return false;

-   amdgpu_res_first(res, 0, res->size, &cursor);
+   amdgpu_res_first(mem, 0, mem_size, &cursor);
+   end = cursor.start + cursor.size;
    while (cursor.remaining) {
-       if ((cursor.start + cursor.size) >= adev->gmc.visible_vram_size)
-           return false;
        amdgpu_res_next(&cursor, cursor.size);
-   }

-   return true;
-}
+       if (!cursor.remaining)
+           break;

-/*
- * amdgpu_res_copyable - Check that memory can be accessed by ttm_bo_move_memcpy
- *
- * Called by amdgpu_bo_move()
- */
-static bool amdgpu_res_copyable(struct amdgpu_device *adev,
-               struct ttm_resource *mem)
-{
-   if (!amdgpu_res_cpu_visible(adev, mem))
-       return false;
+       /* ttm_resource_ioremap only supports contiguous memory */
+       if (end != cursor.start)
+           return false;

-   /* ttm_resource_ioremap only supports contiguous memory */
-   if (mem->mem_type == TTM_PL_VRAM &&
-       !(mem->placement & TTM_PL_FLAG_CONTIGUOUS))
-       return false;
+       end = cursor.start + cursor.size;
+   }

-   return true;
+   return end <= adev->gmc.visible_vram_size;
 }

 /*
@@ -544,8 +529,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,

    if (r) {
        /* Check that all memory is CPU accessible */
-       if (!amdgpu_res_copyable(adev, old_mem) ||
-           !amdgpu_res_copyable(adev, new_mem)) {
+       if (!amdgpu_mem_visible(adev, old_mem) ||
+           !amdgpu_mem_visible(adev, new_mem)) {
            pr_err("Move buffer fallback to memcpy unavailable\n");
            return r;
        }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 32cf6b6f6efd..65ec82141a8e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -139,9 +139,6 @@ int amdgpu_vram_mgr_reserve_range(struct amdgpu_vram_mgr *mgr,
 int amdgpu_vram_mgr_query_page_status(struct amdgpu_vram_mgr *mgr,
                      uint64_t start);

-bool amdgpu_res_cpu_visible(struct amdgpu_device *adev,
-               struct ttm_resource *res);
-
 int amdgpu_ttm_init(struct amdgpu_device *adev);
 void amdgpu_ttm_fini(struct amdgpu_device *adev);
 void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev,
-- 
2.45.0
ThisNekoGuy commented 4 months ago

That specific diff patch does fix the issue with the game launches; I just tested :+1:

Tk-Glitch commented 4 months ago

Output of sudo dmesg | grep BAR ?

ThisNekoGuy commented 4 months ago

This is the output:

[    2.840027] pci 0000:01:00.0: BAR 0 [mem 0xfcf00000-0xfcf03fff 64bit]
[    2.840312] pci 0000:02:00.0: BAR 0 [mem 0xfcea0000-0xfcea7fff 64bit]
[    2.840610] pci 0000:02:00.1: BAR 5 [mem 0xfce80000-0xfce9ffff]
[    2.843070] pci 0000:09:00.0: BAR 0 [mem 0xfcd00000-0xfcd03fff 64bit]
[    2.843542] pci 0000:0a:00.0: BAR 0 [io  0xf000-0xf0ff]
[    2.843578] pci 0000:0a:00.0: BAR 2 [mem 0xfcc04000-0xfcc04fff 64bit]
[    2.843601] pci 0000:0a:00.0: BAR 4 [mem 0xfcc00000-0xfcc03fff 64bit]
[    2.844126] pci 0000:0b:00.0: BAR 0 [mem 0xfcb00000-0xfcb03fff]
[    2.845289] pci 0000:0d:00.0: BAR 0 [mem 0x7800000000-0x7bffffffff 64bit pref]
[    2.845300] pci 0000:0d:00.0: BAR 2 [mem 0x7c00000000-0x7c0fffffff 64bit pref]
[    2.845307] pci 0000:0d:00.0: BAR 4 [io  0xe000-0xe0ff]
[    2.845315] pci 0000:0d:00.0: BAR 5 [mem 0xfc900000-0xfc9fffff]
[    2.845345] pci 0000:0d:00.0: BAR 0: assigned to efifb
[    2.845581] pci 0000:0d:00.1: BAR 0 [mem 0xfca20000-0xfca23fff]
[    2.846209] pci 0000:0f:00.1: BAR 2 [mem 0xfc700000-0xfc7fffff]
[    2.846218] pci 0000:0f:00.1: BAR 5 [mem 0xfc808000-0xfc809fff]
[    2.846341] pci 0000:0f:00.3: BAR 0 [mem 0xfc600000-0xfc6fffff 64bit]
[    2.846513] pci 0000:0f:00.4: BAR 0 [mem 0xfc800000-0xfc807fff]
[    8.792115] [drm] Detected VRAM RAM=16368M, BAR=16384M

Also, I made sure to disable CSM in my BIOS because I read that someone mentioned on that bug report that it supposedly implicitly disables resizable bar? https://gitlab.freedesktop.org/drm/amd/-/issues/3343#note_2401066

Tk-Glitch commented 4 months ago

It does indeed. That looks correct for rebar.

ThisNekoGuy commented 4 months ago

Since this seems to be the problem, should we have a version of this patch (optionally?) down to 6.6(?) since it supposedly affects several 6.x revisions? (Correct me if that's a bad idea or something; I don't know how much effort that would be exactly and whether it wouldn't be worth it - I don't have much experience with making patches)

ptr1337 commented 4 months ago

Since this seems to be the problem, should we have a version of this patch (optionally?) down to 6.6(?) since it supposedly affects several 6.x revisions? (Correct me if that's a bad idea or something; I don't know how much effort that would be exactly and whether it wouldn't be worth it - I don't have much experience with making patches)

There seems to be a commit in 6.7rc7. If you want, you could test please if this is fixed without the above patch. See here: https://github.com/CachyOS/linux/commit/705d0480e6ae5a73ca3a9c04316d0678e19a46ed