Open beliaev-maksim opened 1 year ago
This thread was migrated from launchpad.net
Thresholds are set here: https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/memory_compare.py#n85
According to the git history they were established in 2014 and have served their purpose since then. Let me throw the question back to you, what do you think are reasonable thresholds?
So thinking back on this, here are a few comments:
1: This test has existed for a long, long time. It was (and is) intended to check to see that the amount of memory the kernel sees is reasonably close to what is physically installed on the system (per lshw). Unfortunately, "reasonably close" is difficult to define, and difficult to check for.
2: 10% variance was, at least then, reasonable to account for physical memory reallocated for things like embedded graphics that the kernel never sees. Perhaps newer embedded GPUs are using more shared memory on occasion.
3: Using a percentage was the best way at the time to accomplish this because the amount of shared RAM varies from system to system, GPU to GPU. A hard limit like 256MB for example may be perfectly valid for 50% of systems, but then the other 50% may use 384MG or 512MB (those are arbitrary numbers just for example, they do not reflect actual amounts of shared RAM).
I sometimes think about this test and wonder if there is a better way to do this, because the problem with percentages (and this also bugs me with the ethernet testing too) is, as you've observed, the larger the number the bigger that percentage becomes (10% of 1GB is a lot smaller than 10% of 10GB).
As a thought, at least for this, is there a way to probe how much RAM is being consumed outside the OS by the graphics or other system overhead? That could be a good improvement if you can probe that and then subtract the amount of system shared RAM from what lshw says is installed before comparing it to what the kernel has addressed.
Anyway, just some thoughts. This is more an issue on client systems than servers as my stuff generally has very little shared ram so this test never fails.
Hi @jeff,
I observed that some HP laptops, which uses AMD CPU & GPU, BIOS can setup video memory size. Default settings is auto
, it will use 512 MB. If select 256 MB manually, memory/info will pass.
@Andy,
In this case,
please refer something like:
$ lspci -nnv -d ::0x0302
01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1)
Subsystem: Lenovo ThinkPad T480 [17aa:225e]
Flags: bus master, fast devsel, latency 0, IRQ 169, IOMMU group 13
Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
Memory at 80000000 (64-bit, prefetchable) [size=256M] # <----- here
Memory at 90000000 (64-bit, prefetchable) [size=32M]
I/O ports at d000 [size=128]
Capabilities:
could you please help to confirm the memory here is same as you saw in BIOS?
When implementing the solution, please consider the multi GPU cases.
To filter out the iGPU, please refer ACPI spec and the following:
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:00\:02.0/firmware_node/adr
0x00020000
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:01\:00.0/firmware_node/adr
0x00000000
for GPU class, please consider all display classes (e.g. 0x0300, 0x0302, etc...)
@jeremy,
$ lspci -nnv -d ::0x0300
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:1638] (rev d3) (prog-if 00 [VGA controller])
DeviceName: Onboard IGD
Subsystem: Hewlett-Packard Company Device [103c:8895]
Flags: bus master, fast devsel, latency 0, IRQ 51
Memory at 260000000 (64-bit, prefetchable) [size=256M]
Memory at 270000000 (64-bit, prefetchable) [size=2M]
I/O ports at 1000 [size=256]
Memory at fb300000 (32-bit, non-prefetchable) [size=512K]
Capabilities:
It shows 256M in lspci instead 512 MB in kernel log.
[kernel log] [ 0.870069] [drm] amdgpu: 512M of VRAM memory ready [ 0.870072] [drm] amdgpu: 3072M of GTT memory ready.
Seems like the this memory region is not responsible for BIOS reserved and the kernel logs are reported by amdgpu (probably get from FW).
I think we need to list all FW reserved memory first.
/proc/meminfo reports:6.61GiB lshw reports:8GiB
FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.
For simple, could we just update the 6G to 8G? Thanks!
Before you just change the criteria to make the test pass, can you answer the question of WHY so much memory is being shunted elsewhere?
In this case, you have a machine with 8GB of RAM, and nearly 20% of that RAM is unavailable to the OS because it's being consumed somewhere else. I'm not saying that lowering the limit just to get the test to pass is the wrong answer here, only that by lowering the threshold to fail, you're likely to hide other cases where this shouldn't be happening.
In general, when I review certs, I expect that some things will fail in some cases, and in those cases I will ask questions, and either accept that or reject it based on the answers to those questions. IMO, in the case of a test that has existed for 8 years and has done it's job all that time, lowering it because one machine isn't working as the test expects seems a bit premature?
If the system reserves so much memory this should be well documented and justified. But this IMHO should not warrant changing the thresholds for all systems. If there is justification for that special system, create a custom job for that system, or make the threshold customizable via configs with the default being what has been used for years.
I reviewed all the related bugs in sutton project, all the configs are AMD platforms. And found it failed when ram is bigger than 8G. And change to 20 for 8G could not fix all issues.
On M75n I found the difference is 28.34%, cause it used 2G shared memory for VRAM. And I could not change the value from BIOS.
Results: /proc/meminfo reports: 5.73GiB lshw reports: 8GiB
FAIL: Meminfo reports 2434531328 less than lshw, a difference of 28.34%. Only a variance of 10% in reported memory is allowed.
[ 0.746168] amdgpu 0000:04:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used) [ 0.746181] [drm] Detected VRAM RAM=2048M, BAR=2048M [ 0.746246] [drm] amdgpu: 2048M of VRAM memory ready
On drift3-amd, the memory is 32G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB,4GB and 8G options. When I set 1G or 2G this testcase is passed, the 'Auto' mean 4G from dmesg. This issue looks not related to Prefetchable value in lspci. It will keep 256M whatever the VRAM's value is.
In this case 4G used as default for shared memory, it sounds good, how could we avoid the failure of memory/info testcase?
Results: /proc/meminfo reports: 27.25GiB lshw reports: 32GiB
FAIL: Meminfo reports 5104971776 less than lshw, a difference of 14.86%. Only a variance of 10% in reported memory is allowed.
Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of VRAM memory ready Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of GTT memory ready.
$ sudo lspci -nv | grep Prefetchable Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: 0000000830000000-00000008301fffff [size=2M] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: 0000000860000000-00000008701fffff [size=258M]
On golem-amd, the memory is 8G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB and 4GB options. By default the 'Auto' means 1G from dmesg.
Is it possible that we compare lshw with the sum of VRAM and /proc/meminfo?
[ 1.057123] [drm] Detected VRAM RAM=1024M, BAR=1024M [ 1.057123] [drm] RAM width 64bits DDR4 [ 1.057152] [drm] amdgpu: 1024M of VRAM memory ready [ 1.057153] [drm] amdgpu: 3072M of GTT memory ready.
Results: /proc/meminfo reports: 6.61GiB lshw reports: 8GiB
FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.
From 'glxinfo -B', we could get the 'Video memory', if we could count this value with /proc/meminfo, then all the platforms in my side could fix this issue. Thanks!
Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD RENOIR (DRM 3.42.0, 5.14.0-1027-oem, LLVM 12.0.0) (0x15e7) Version: 21.2.6 Accelerated: yes Video memory: 1024MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2
The value from comment#15 is GLX_RENDERER_UNIFIED_MEMORY_ARCHITECTURE_MESA which is not exactly correct in my I+N system.
If possible, then we better to get the reserved memory from kernel space. Thus, I wondering why the AMDGPU doesn't show the reserved memory in lspci?
@kaihengfeng,
Here is the full lspci. Thanks!
For amdgpu, there's a sysfs attribute 'mem_info_vram_total' shows carved out ram size. So please consider that in checkbox logic.
Using BAR size as VRAM size is only accurate for discrete AMD GFX. AMD APU has its own way to decide VRAM size.
I've no idea to know if AMDGPU belongs to APU unless amdgpu_device->flag exports it.
The "mem_info_vram_total" seems work in amd iGPU and dGPU. Let's consider to count them by gpu vendor.
So it's better to just use "mem_info_vram_total" - it will work regardless of integrated or discrete.
yeap, I proposed it as https://code.launchpad.net/~os369510/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+merge/416966
This issue was migrated from https://bugs.launchpad.net/plainbox-provider-checkbox/+bug/1941854
Summary
Description
[I/O log] Results: /proc/meminfo reports: 7.16GiB lshw reports: 8GiB
FAIL: Meminfo reports 905527296 less than lshw, a difference of 10.54%. Only a variance of 10% in reported memory is allowed.
[Reproduce Steps]
Why the tolerance is 10%? I have another machine with 16G ram, but 14.6 in meminfo. This test case can pass due to bigger denominator.
Attachments
submission_2021-05-07T07.52.07.084481.tar lspci_nvv.txt
Tags: ['cbox-52', 'oem-priority', 'originate-from-1927709', 'originate-from-1938006', 'originate-from-1953698', 'originate-from-1954987', 'originate-from-1958337', 'originate-from-1958473', 'originate-from-1958516', 'originate-from-1962148', 'originate-from-1964451', 'originate-from-1974175', 'originate-from-1976476', 'originate-from-1990217', 'somerville', 'stella', 'sutton']