canonical / checkbox

Checkbox is a testing framework used to validate device compatibility with Ubuntu Linux. It’s the testing tool developed for the purposes of the Ubuntu Certification program.
https://checkbox.readthedocs.io
GNU General Public License v3.0
32 stars 48 forks source link

LP1941854: memory/info failed on system which ram is lower than 8G #191

Open beliaev-maksim opened 1 year ago

beliaev-maksim commented 1 year ago

This issue was migrated from https://bugs.launchpad.net/plainbox-provider-checkbox/+bug/1941854

Summary

Status Created on Heat Importance Security related
Incomplete 2021-08-27 08:17:32 10 Undecided False

Description

[I/O log] Results: /proc/meminfo reports: 7.16GiB lshw reports: 8GiB

FAIL: Meminfo reports 905527296 less than lshw, a difference of 10.54%. Only a variance of 10% in reported memory is allowed.

[Reproduce Steps]

  1. sudo checkbox-cli run com.canonical.certification::memory/info

Why the tolerance is 10%? I have another machine with 16G ram, but 14.6 in meminfo. This test case can pass due to bigger denominator.

Attachments

submission_2021-05-07T07.52.07.084481.tar lspci_nvv.txt

Tags: ['cbox-52', 'oem-priority', 'originate-from-1927709', 'originate-from-1938006', 'originate-from-1953698', 'originate-from-1954987', 'originate-from-1958337', 'originate-from-1958473', 'originate-from-1958516', 'originate-from-1962148', 'originate-from-1964451', 'originate-from-1974175', 'originate-from-1976476', 'originate-from-1990217', 'somerville', 'stella', 'sutton']

beliaev-maksim commented 1 year ago

This thread was migrated from launchpad.net

https://launchpad.net/~jocave wrote on 2021-08-27 09:04:42:

Thresholds are set here: https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/memory_compare.py#n85

According to the git history they were established in 2014 and have served their purpose since then. Let me throw the question back to you, what do you think are reasonable thresholds?

https://launchpad.net/~bladernr wrote on 2021-09-03 18:42:45:

So thinking back on this, here are a few comments:

1: This test has existed for a long, long time. It was (and is) intended to check to see that the amount of memory the kernel sees is reasonably close to what is physically installed on the system (per lshw). Unfortunately, "reasonably close" is difficult to define, and difficult to check for.

2: 10% variance was, at least then, reasonable to account for physical memory reallocated for things like embedded graphics that the kernel never sees. Perhaps newer embedded GPUs are using more shared memory on occasion.

3: Using a percentage was the best way at the time to accomplish this because the amount of shared RAM varies from system to system, GPU to GPU. A hard limit like 256MB for example may be perfectly valid for 50% of systems, but then the other 50% may use 384MG or 512MB (those are arbitrary numbers just for example, they do not reflect actual amounts of shared RAM).

I sometimes think about this test and wonder if there is a better way to do this, because the problem with percentages (and this also bugs me with the ethernet testing too) is, as you've observed, the larger the number the bigger that percentage becomes (10% of 1GB is a lot smaller than 10% of 10GB).

As a thought, at least for this, is there a way to probe how much RAM is being consumed outside the OS by the graphics or other system overhead? That could be a good improvement if you can probe that and then subtract the amount of system shared RAM from what lshw says is installed before comparing it to what the kernel has addressed.

Anyway, just some thoughts. This is more an issue on client systems than servers as my stuff generally has very little shared ram so this test never fails.

https://launchpad.net/~andch wrote on 2021-09-06 03:22:55:

Hi @jeff, I observed that some HP laptops, which uses AMD CPU & GPU, BIOS can setup video memory size. Default settings is auto, it will use 512 MB. If select 256 MB manually, memory/info will pass.

https://launchpad.net/~os369510 wrote on 2021-09-06 03:43:08:

@Andy,

In this case,

please refer something like:

$ lspci -nnv -d ::0x0302 01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1) Subsystem: Lenovo ThinkPad T480 [17aa:225e] Flags: bus master, fast devsel, latency 0, IRQ 169, IOMMU group 13 Memory at dc000000 (32-bit, non-prefetchable) [size=16M] Memory at 80000000 (64-bit, prefetchable) [size=256M] # <----- here Memory at 90000000 (64-bit, prefetchable) [size=32M] I/O ports at d000 [size=128] Capabilities: Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia

could you please help to confirm the memory here is same as you saw in BIOS?

When implementing the solution, please consider the multi GPU cases.

To filter out the iGPU, please refer ACPI spec and the following:

jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:00\:02.0/firmware_node/adr 
0x00020000
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:01\:00.0/firmware_node/adr 
0x00000000

for GPU class, please consider all display classes (e.g. 0x0300, 0x0302, etc...)

https://launchpad.net/~andch wrote on 2021-09-07 08:32:17:

@jeremy,

$ lspci -nnv -d ::0x0300 04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:1638] (rev d3) (prog-if 00 [VGA controller]) DeviceName: Onboard IGD Subsystem: Hewlett-Packard Company Device [103c:8895] Flags: bus master, fast devsel, latency 0, IRQ 51 Memory at 260000000 (64-bit, prefetchable) [size=256M] Memory at 270000000 (64-bit, prefetchable) [size=2M] I/O ports at 1000 [size=256] Memory at fb300000 (32-bit, non-prefetchable) [size=512K] Capabilities: Kernel driver in use: amdgpu Kernel modules: amdgpu

It shows 256M in lspci instead 512 MB in kernel log.

[kernel log] [ 0.870069] [drm] amdgpu: 512M of VRAM memory ready [ 0.870072] [drm] amdgpu: 3072M of GTT memory ready.

https://launchpad.net/~os369510 wrote on 2021-09-07 11:45:48:

Seems like the this memory region is not responsible for BIOS reserved and the kernel logs are reported by amdgpu (probably get from FW).

I think we need to list all FW reserved memory first.

https://launchpad.net/~binli wrote on 2022-03-15 06:05:01:

/proc/meminfo reports:6.61GiB lshw reports:8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

https://launchpad.net/~binli wrote on 2022-03-15 06:10:18:

For simple, could we just update the 6G to 8G? Thanks!

https://launchpad.net/~bladernr wrote on 2022-03-15 14:04:43:

Before you just change the criteria to make the test pass, can you answer the question of WHY so much memory is being shunted elsewhere?

In this case, you have a machine with 8GB of RAM, and nearly 20% of that RAM is unavailable to the OS because it's being consumed somewhere else. I'm not saying that lowering the limit just to get the test to pass is the wrong answer here, only that by lowering the threshold to fail, you're likely to hide other cases where this shouldn't be happening.

In general, when I review certs, I expect that some things will fail in some cases, and in those cases I will ask questions, and either accept that or reject it based on the answers to those questions. IMO, in the case of a test that has existed for 8 years and has done it's job all that time, lowering it because one machine isn't working as the test expects seems a bit premature?

https://launchpad.net/~kissiel wrote on 2022-03-15 17:49:48:

If the system reserves so much memory this should be well documented and justified. But this IMHO should not warrant changing the thresholds for all systems. If there is justification for that special system, create a custom job for that system, or make the threshold customizable via configs with the default being what has been used for years.

https://launchpad.net/~binli wrote on 2022-03-16 08:40:33:

I reviewed all the related bugs in sutton project, all the configs are AMD platforms. And found it failed when ram is bigger than 8G. And change to 20 for 8G could not fix all issues.

On M75n I found the difference is 28.34%, cause it used 2G shared memory for VRAM. And I could not change the value from BIOS.

Results: /proc/meminfo reports: 5.73GiB lshw reports: 8GiB

FAIL: Meminfo reports 2434531328 less than lshw, a difference of 28.34%. Only a variance of 10% in reported memory is allowed.

[ 0.746168] amdgpu 0000:04:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used) [ 0.746181] [drm] Detected VRAM RAM=2048M, BAR=2048M [ 0.746246] [drm] amdgpu: 2048M of VRAM memory ready

https://launchpad.net/~binli wrote on 2022-03-16 09:07:27:

On drift3-amd, the memory is 32G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB,4GB and 8G options. When I set 1G or 2G this testcase is passed, the 'Auto' mean 4G from dmesg. This issue looks not related to Prefetchable value in lspci. It will keep 256M whatever the VRAM's value is.

In this case 4G used as default for shared memory, it sounds good, how could we avoid the failure of memory/info testcase?

Results: /proc/meminfo reports: 27.25GiB lshw reports: 32GiB

FAIL: Meminfo reports 5104971776 less than lshw, a difference of 14.86%. Only a variance of 10% in reported memory is allowed.

Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of VRAM memory ready Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of GTT memory ready.

$ sudo lspci -nv | grep Prefetchable Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: 0000000830000000-00000008301fffff [size=2M] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Prefetchable memory behind bridge: 0000000860000000-00000008701fffff [size=258M]

https://launchpad.net/~binli wrote on 2022-03-16 09:29:05:

On golem-amd, the memory is 8G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB and 4GB options. By default the 'Auto' means 1G from dmesg.

Is it possible that we compare lshw with the sum of VRAM and /proc/meminfo?

[ 1.057123] [drm] Detected VRAM RAM=1024M, BAR=1024M [ 1.057123] [drm] RAM width 64bits DDR4 [ 1.057152] [drm] amdgpu: 1024M of VRAM memory ready [ 1.057153] [drm] amdgpu: 3072M of GTT memory ready.

Results: /proc/meminfo reports: 6.61GiB lshw reports: 8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

https://launchpad.net/~binli wrote on 2022-03-17 06:40:49:

From 'glxinfo -B', we could get the 'Video memory', if we could count this value with /proc/meminfo, then all the platforms in my side could fix this issue. Thanks!

Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD RENOIR (DRM 3.42.0, 5.14.0-1027-oem, LLVM 12.0.0) (0x15e7) Version: 21.2.6 Accelerated: yes Video memory: 1024MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2

https://launchpad.net/~os369510 wrote on 2022-03-17 14:26:28:

The value from comment#15 is GLX_RENDERER_UNIFIED_MEMORY_ARCHITECTURE_MESA which is not exactly correct in my I+N system.

If possible, then we better to get the reserved memory from kernel space. Thus, I wondering why the AMDGPU doesn't show the reserved memory in lspci?

https://launchpad.net/~binli wrote on 2022-03-18 05:48:08:

@kaihengfeng,

Here is the full lspci. Thanks!

https://launchpad.net/~kaihengfeng wrote on 2022-03-18 13:16:49:

For amdgpu, there's a sysfs attribute 'mem_info_vram_total' shows carved out ram size. So please consider that in checkbox logic.

Using BAR size as VRAM size is only accurate for discrete AMD GFX. AMD APU has its own way to decide VRAM size.

https://launchpad.net/~os369510 wrote on 2022-03-22 15:49:43:

I've no idea to know if AMDGPU belongs to APU unless amdgpu_device->flag exports it.

The "mem_info_vram_total" seems work in amd iGPU and dGPU. Let's consider to count them by gpu vendor.

https://launchpad.net/~kaihengfeng wrote on 2022-03-23 03:23:34:

So it's better to just use "mem_info_vram_total" - it will work regardless of integrated or discrete.

https://launchpad.net/~os369510 wrote on 2022-03-23 05:26:06:

yeap, I proposed it as https://code.launchpad.net/~os369510/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+merge/416966