armbian / build

Armbian Linux build framework generates custom Debian or Ubuntu image for x86, aarch64, riscv64 & armhf
https://www.armbian.com
GNU General Public License v2.0
3.99k stars 2.25k forks source link

free() invalid pointer #4761

Closed blmhemu closed 1 year ago

blmhemu commented 1 year ago

What happened?

Getting free() invalid pointer issue when installing / using python3. Could be dpkg issue as well ! Board: Helios64 (I know, I know CSC) Chipset: RK3399

Screenshot 2023-01-28 at 9 44 08 AM

  ansible_facts: {}
  failed_modules:
    ansible.legacy.setup:
      ansible_facts:
        discovered_interpreter_python: /usr/bin/python3
      failed: true
      module_stderr: |-
        free(): invalid pointer
        Aborted
      module_stdout: ''
      msg: |-
        MODULE FAILURE
        See stdout/stderr for the exact error
      rc: 134
  msg: |-
    The following modules failed to execute: ansible.legacy.setup

When I did sudo apt update && sudo apt upgrade, it happened and hence I tried to reinstall. This issue occurs when trying to manage it with ansible as well. I think something might be wrong with the latest python.

Branch

master (main development branch)

On which host OS are you observing this problem?

Jammy

Relevant log output

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 2 reinstalled, 0 to remove and 0 not upgraded.
Need to get 0 B/472 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 44574 files and directories currently installed.)
Preparing to unpack .../python3-pkg-resources_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
double free or corruption (out)
Aborted
dpkg: warning: old python3-pkg-resources package pre-removal script subprocess returned error exit status 134
dpkg: trying script from the new package instead ...
dpkg: ... it looks like that went OK
Unpacking python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Preparing to unpack .../python3-setuptools_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
Unpacking python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
free(): invalid pointer
Aborted
dpkg: error processing package python3-setuptools (--configure):
 installed python3-setuptools package post-installation script subprocess returned error exit status 134
Errors were encountered while processing:
 python3-setuptools
E: Sub-process /usr/bin/dpkg returned an error code (1)

Code of Conduct

EvilOlaf commented 1 year ago

Unmaintained hardware Not a building issue. Please use forums for issues when running Armbian.

rpardini commented 1 year ago

Helios64 suffers (at least) from an overwritten DTS during rockchip64 patching, and is in dire need of a maintainer to clean it up. My efforts to find one have failed: people either have an old kernel running in a "production" helios64, or not willing to put in the effort.

blmhemu commented 1 year ago

Have some questions. Would be great if you can help me understand the following

Cross-linking https://forum.armbian.com/topic/26295-free-invalid-pointer-when-installing-python3-setuptools/ for those looking.

igorpecovnik commented 1 year ago

General instructions: https://docs.armbian.com/Board_Maintainers_Procedures_and_Guidelines/

Do you have some pointers / code references / prs on fixing this DTS overwrite issues ?

Check upstream code and what our patches does. Perhaps they are not needed anymore (maintainer - you - have to know that) at all or we have some features enabled that are not upstream or vice versa. Resolving that patch diffs.

Should we constantly flash the device and lose "production" persistent settings / config ?

Only prior to upcoming release or by request (rarely). But checking functionality by booting from SD card is already a lot better then no checking at all. No need to mess up your existing setup.

blmhemu commented 1 year ago

Check upstream code and what our patches does. Perhaps they are not needed anymore (maintainer - you - have to know that) at all or we have some features enabled that are not upstream or vice versa. Resolving that patch diffs.

I just checked upstream code dts - it was unchanged from jan 2022. Not sure if upstream change is the reason. But will try to look more into it tomorrow.

Only prior to upcoming release or by request (rarely). But checking functionality by booting from SD card is already a lot better then no checking at all. No need to mess up your existing setup.

That sounds promising.

Tonymac32 commented 1 year ago

For the DTS overwrite, basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline. Then it became available in mainline and the Armbian patch never got revised/removed. All the helios64 specific patches need rebased on the mainline device tree.

blmhemu commented 1 year ago

For the DTS overwrite, basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline. Then it became available in mainline and the Armbian patch never got revised/removed. All the helios64 specific patches need rebased on the mainline device tree.

To build it, i am assuming i need to make changes to patches in 5.15 folder and compile armbian again. Let me know if it is the way.

Will try to do it and report if it worked or not.

rpardini commented 1 year ago

basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline

Yes. The patch added a new file back then. Now, that file already exists in mainline, but is completely removed by the bash patching done in the master branch, thus becoming this monstrosity: https://github.com/rpardini/linux/commit/dc718a4d6ef073662c1d3f036136383a8428ff04

The fact it even boots is... surprising.

blmhemu commented 1 year ago

So, I tried removing the patch add-board-kobol-helios64... in 5.15 and compiled armbian. It did compile fine and even booted. Things I observed:

Also, a few more logs from dpkg

D000001: ensure_diversions: new, (re)loading
D000001: process queue pkg python3-pkg-resources:all queue.len 0 progress 1, try 1
D000040: checking dependencies of python3-pkg-resources:all (- <none>)
D000400:   checking group ...
D000400:     checking possibility  -> python3
D000400:       checking non-provided pkg python3:arm64
D000400:       is installed, ok and found
D000400:     found 3
D000400:   found 3 matched 0 possfixbytrig -
D000040: ok 2 msgs >><<
D000040:     checking Breaks
Setting up python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) ...
D000002: fork/exec /var/lib/dpkg/info/python3-pkg-resources.postinst ( configure 59.6.0-1.2ubuntu0.22.04.1 )
free(): invalid pointer
Aborted
dpkg: error processing package python3-pkg-resources (--configure):
 installed python3-pkg-resources package post-installation script subprocess returned error exit status 134
D000001: ensure_diversions: same, skipping
Errors were encountered while processing:
 python3-pkg-resources
❯ cat /var/lib/dpkg/info/python3-pkg-resources.postinst
#!/bin/sh
set -e

# Automatically added by dh_python3
if command -v py3compile >/dev/null 2>&1; then
    py3compile -p python3-pkg-resources
fi
if command -v pypy3compile >/dev/null 2>&1; then
    pypy3compile -p python3-pkg-resources  || true
fi

# End automatically added section

May have found the root cause

Screenshot 2023-01-29 at 4 45 08 PM

Digging even further Screenshot 2023-01-29 at 5 48 55 PM

Related searches

rpardini commented 1 year ago
  • The device booted ! Yay !
  • The fans are running at full speed unlike before.

Nice, probably some picking from the previous patch can result in a good-enough new patch.

  • The memory double free issue persists. Still getting the invalid pointer or double free issue. It honestly feels like something wrong with userspace / ubuntu / debian.

Did you try switching userspace? RELEASE=jammy etc?

blmhemu commented 1 year ago

Nice, probably some picking from the previous patch can result in a good-enough new patch.

Likely.

Did you try switching userspace? RELEASE=jammy etc?

I tried with jammy, impish, kinetic, debian. All suffer from this. This makes me think if this is a kernel / syscall issue 🤔

prahal commented 1 year ago

I also have this issue running Ansible to the helios64 python3. I was able to reproduce your py3compile crash. Both py3compile crash and Ansible rule that crash rae random. I sometimes get a kernel oops. Unlikely DTS. Likely a kernel code patch. Might even affect other rockchip64 device because most apps runs fine on the helios64. I always ends up with a kernel memory error (also random) and need to reboot.

blmhemu commented 1 year ago

I also have this issue running Ansible to the helios64 python3. I was able to reproduce your py3compile crash. Both py3compile crash and Ansible rule that crash rae random. I sometimes get a kernel oops. Unlikely DTS. Likely a kernel code patch. Might even affect other rockchip64 device because most apps runs fine on the helios64. I always ends up with a kernel memory error (also random) and need to reboot.

I tried searching for free() function calls, but could not find in patches 🤔

prahal commented 1 year ago

@blmhemu free is a userspace call. There is no such call as free in the kernel. All in all, I do not see how a python3 call can end up in kernel memory corruption. But I am unable to reproduce the python3 free invalid pointer issue on another armbian box also running bullseye with linux-image current 22.11.4 kernel 6.1.7-meson64. So userspace python3 is identical on both. It could still be that the issue is only harder to reproduce on armbian meson, hard to ascertain.

I was able to reproduce on helios64 with linux image edge 22.11.4 kernel 6.1.7-rockchip64 but I was able to do two runs of ansible before the bug "free(): invalid pointer" triggered. I also got "munmap_chunk(): invalid pointer" from running sudo py3compile -p python3-pkg-resources in a loop.

About the lack of maintainership, I would like to help with that, but until I find out why my board is unstable I focus on debugging the stability issue. Be it the kernel crash due to memory corruption (and half of the time the board hangs without even rebooting on panic, and most of the time at reboot after panic it hangs early on) or the rk3399 hs400es breakage that I nailed down to a commit in https://forum.armbian.com/topic/18855-upgrading-to-bullseye-troubleshooting-armbian-21081/page/3/#comment-128793.

igorpecovnik commented 1 year ago

About the lack of maintainership, I would like to help with that

https://docs.armbian.com/Board_Maintainers_Procedures_and_Guidelines/ tl;dr; = scan the text, send contact details (link is in the text) and beeing around to run some tests and fix (or at least log) problems prior to major releases / when kernel version is changed or similar bigger changes like upcoming build system upgrade.

rpardini commented 1 year ago

Next suggestion: remove kernel patches, maybe all of them, and try a build with mostly mainline only stuff. Does problem persist? If not (my bet...) bisect the evil patch out of rockchip64... another idea: update u-boot and/or blobs.

prahal commented 1 year ago

I change vin-supply in pwm-supply in helios64 board dts vdd-log section. vdd-log is known for stability issues if not powered properly and without this fix it was assigned the dummy regulator.

Hard stretched but I am still not set if the helios64 kernel memory errors I have are due to a driver which corrupts the memory or a wrong setting that makes the CPU unstable.

prahal commented 1 year ago

@blmhemu do you run OMV above helios64?

OMV has a few optimizations (sysctl), tools that stress the memory (folder2ram), and probably others that might make the bug from the unknown source more visible, but it could be that the issue is not helios64 specific.

Sorry unlikely an OMV-related issue as from above I understand you ran vanilla jammy, impish, kinetic, debian and reproduced.

Tonymac32 commented 1 year ago

can anyone reproduce this issue with another RK3399 board?

blmhemu commented 1 year ago

@blmhemu do you run OMV above helios64?

No. Plain ubuntu.

prahal commented 1 year ago

I do not have another RK3399 board to test. If anyone could, I made a simpler test case:

$ for i in $(seq 1 100);do python3 -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
double free or corruption (out)
Abandon (core dumped)

I also get "free(): invalid pointer", "double free or corruption (out)" or else.

I tried with the python3 debugger:

$ for i in $(seq 1 100);do python3-dbg -X tracemalloc -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
Debug memory block at address p=0x800ffffb7b29ab0: API ''
    18302063728016752640 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0x00 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00 *** OUCH
        at p-4: 0x00 *** OUCH
        at p-3: 0x00 *** OUCH
        at p-2: 0x00 *** OUCH
        at p-1: 0x00 *** OUCH
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0x5fefdfdb4b29ab0 are Erreur de segmentation (core dumped)

this is with the python3 debugger with the tracemalloc flag. Note that I have a hard time getting an error with the tracemalloc flag, most of the time the hundredth tests run proceed fine with this flag. As it slows down the process quite a lot it could point to a timing-dependent issue.

With the python3 debugger without flags:

for i in $(seq 1 100);do python3-dbg -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
Debug memory block at address p=0x4000ffffbb8f58d0: API '�'
    3508782105221857280 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x70b2b2bbbb8e58d0 are Erreur de segmentation (core dumped)

Note that if I do the loop in python code instead of calling the python runtime in a loop the crash does not occur.

import pkg_resources

for x in range(0, 10000):
    pkg_resources.parse_version("1")

works.

I also wonder why only python3 is affected on my system. Maybe running other python3 setups in docker containers on the helios64 could confirm if this is due to the userspace setup (or it could also be that this particular python3.9 setup stress test a specific issue with the kernel or hardware and another setup will just hide the issue).

prahal commented 1 year ago

@blmhemu could you paste the output of /proc/buddyinfo when python3 starts to output invalid free? It seems that python3 does not cope well when an allocation fails and tries to free it even if it was not allocated. That may explain our issue. I had page allocation of order 7 errors in my kernel log (but not always).

What is not clear is why we get these even though all was fine before. The fact is it may be another issue. In the process of debugging this invalid free I turned off my zswap so maybe I produced a page allocation failure with my debug attempts. Also, it is not that I do not have memory left but that memory is fragmented and there are no higher allocation pages left.

I tested in a docker container on the helios64 bullseye (still with latest master edge kernel) with Debian bookworm python3.11 in the container and was able to reproduce the invalid free.

prahal commented 1 year ago

@blmhemu sorry to bother you again. Do your tests with different releases (jammy, kinetic, etc) all run with a different "current" armbian kernel, or did the build run with the latest kernel (6.1 ?)? The issue is python3.9 on my board was not upgraded for months and it was working (a long time ago but I cannot tell when it broke. Still, I doubt it was broken a year ago. So this leaves the libraries below python3.9 or the kernel. Though if you already tested a lot of stable kernels this makes the kernel an unlikely target and can save time in debugging. I tried the latest armbian kernel 6.1.12 without armbian patches (but still with helios64 armbian dts, aufs and wifi patches) and I still get the python3 free invalid.

prahal commented 1 year ago

I made a mistake and thus u-boot booted on my old eMMC install (which I left untouched since at least July 2022). The issue python3 invalid free is reproducible there ... 5.15.48-rockchip64 #22.05.3. I do not know how I did notice back then this ansible python3 issue (I already had ansible setup for the helios64 but it is true that I had issues with my roles and thus tended to run the playbooks on a targeted system so maybe I did not try helios64 for a long time or at all).

Then I tried on the SD card install (up to date bullseye) with current armbian kernel Linux helios64 5.15.89-rockchip64 #22.11.4 and the issue is also reproducible.

blmhemu commented 1 year ago

Hey @prahal ! I am currently a bit busy with life and will likely start testing from next week. In the meanwhile, I did install the latest bullseye from this armbian mirror and it works smoothly with ansible and could not see the free issue.

prahal commented 1 year ago

@blmhemu I confirm that just installing linux-image-current-rockchip64 package and its ad-hoc linux-dtb current-rockchip64 package at version 21.08.2, which is 5.10.63-rockchip64, fixes this issue. I was able to run:

for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done

five times without an issue.

I can even run the test case fine with latest kernels if I disable cpufreq with kernel boot parameter cpufreq.off=1

prahal commented 1 year ago

I have been able to reproduce the python3 invalid free with linux-image-legacy-rk3399 that is 4.4.213. I believe I did not encounter the issue before because before my ansible setup was using the python2 installed on the helios64, not the python3.

@blmhemu, I need to retry but I believe with cpufreq disabled (cpufreq.off=1 on linux kernel command line) I was not able to reproduce the python3 invalid free with latest kernel (and probably current too). If you could confirm that would help. You can add cpufreq.off=1 to /boot/armbianEnv.txt extraargs=.

Also I tried latest 6.1.12 with cpufreq enabled and all armbian patches removed except the add helios64 board add-board-helios64.patch, board-helios64-remove-pcie-ep-gpios.patch, my emmc hs400 es patch to read emmc hs400 and rk3399-enable-dwc3-xhci-usb-trb-quirk.patch and I can still reproduce the issue (I also tried with EXTRAWIFI=no AUFS=no).

thomas-maurice commented 1 year ago

Hello !

@prahal I can confirm that the cpufreq.off=1 worked on the current bulleseye build, I was able to successfully install OMV on my helios64 wihtout any issues

root@helios64:~# uname -a
Linux helios64 5.15.104-rockchip64 #3 SMP PREEMPT Wed Mar 22 12:31:37 UTC 2023 aarch64 GNU/Linux

I built the image with the ./compile.sh script of the toolchain without any customisation.

blmhemu commented 1 year ago

Also I tried latest 6.1.12 with cpufreq enabled and all armbian patches removed except the add helios64 board add-board-helios64.patch, board-helios64-remove-pcie-ep-gpios.patch, my emmc hs400 es patch to read emmc hs400 and rk3399-enable-dwc3-xhci-usb-trb-quirk.patch and I can still reproduce the issue (I also tried with EXTRAWIFI=no AUFS=no).

Does this mean, it could be a bug in upstream kernel ?

prahal commented 1 year ago

@blmhemu It is a bug that could affect the upstream kernel. But I believe to bug report the upstream kernel, one would have to test with the upstream kernel helios64 dts and I did not try with it. Could be that an HW feature we enable in the armbian helios64 dts is not implemented correctly or has a hardware issue.

I had issue with swap code (I cannot tell if it is the swap code that has issue though unlikely or if another bug makes the swap code crash). Do you have a swap partition, zram or zswap (the two latters should be on on Armbian by default though)? I have not yet tested with no swap at all.

Call for help: I wonder how one could test if the HW is behaving properly. I tried testing voltage on the board while it is running (but the trace are so small I am afraid to short them). Voltages looked fine (12V, 5V). Also, the helios64 chips were delivered during COVID so I am wondering if people having no crash at all are just not stressing their HW enough or not triggering specific software code paths or if only a handful of boards have faulty chips. Seems borg, urbackup and jellyfin stress the board enough to get it to require hacks like cpufreq.off (but I still get crash from time to time so the issue might not be cpufreq at all, only that cpufreq renders it more frequently... or there are twxo separate issue. I investigate the PCI bug with RK3399 but it seems it is not the same issue I encountered, ie the error log do not match - but if it is upgrading the HDD firmware might help. I have a power domain issue at automatic reboot after a board hang. There was an issue with the U-Boot ATF not resetting the power domains but it should be fixed long ago).

jmue commented 1 year ago

Hi, I'm also a owner of a helios64 device.

I downloaded https://redirect.armbian.com/helios64/Bookworm_current_minimal and also encountered the same Python crashes.

I took another sd-card and installed https://imola.armbian.com/archive/helios64/archive/Armbian_21.08.2_Helios64_bullseye_current_5.10.63.img.xz on it. There are no crashes, everything is stable. So I installed all available updates for user space, kernel and bsp, but still no python errors. After that I upgraded the image to Bookworm user space as described in https://www.cyberciti.biz/faq/update-upgrade-debian-11-to-debian-12-bookworm/, no Python crashes occur either. Then I upgraded to bookworm/jammy bsp & kernel and the system is running stable on every python test. I have also cleared all package differences.

So now I have two different SD cards with the same package base running on Armbian 23.08 trunk with kernel 6.1.38, one behaves stable, the other one crashes every time on the provided Python test.

Here I'm lost, I cant find the missing piece. I

There are minor variations but I can't find the decisive one.

Still the upgraded system behaves stable whereas the newly installed one crashes at the python test.

Maybe someone can confirm my findings.

prahal commented 1 year ago

@jmue u-boot is different on both cards? Could you share the U-Boot prompt from each of the SD cards? I would like to try with the u-boot from your good SD card.

Mind I had segfaults with other programs than python3 even though rarely (atop recently).

Just now I saw that I get memory allocation failures (due to fragmentation) when my python3 test case errors out:

phn@helios64:~$ for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
munmap_chunk(): invalid pointer
Abandon (core dumped)

juil. 18 03:13:44 helios64 kernel: warn_alloc: 1 callbacks suppressed
juil. 18 03:13:44 helios64 kernel: python3: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=user.slice,mems_allowed=0
juil. 18 03:13:44 helios64 kernel: CPU: 4 PID: 1028782 Comm: python3 Tainted: G        WC         6.3.12-rockchip64 #12
juil. 18 03:13:44 helios64 kernel: Hardware name: Helios64 (DT)
juil. 18 03:13:44 helios64 kernel: Call trace:
juil. 18 03:13:44 helios64 kernel:  dump_backtrace+0x9c/0x120
juil. 18 03:13:44 helios64 kernel:  show_stack+0x18/0x24
juil. 18 03:13:44 helios64 kernel:  dump_stack_lvl+0x48/0x60
juil. 18 03:13:44 helios64 kernel:  dump_stack+0x1c/0x28
juil. 18 03:13:44 helios64 kernel:  warn_alloc+0x1b0/0x290
juil. 18 03:13:44 helios64 kernel:  __alloc_pages+0x127c/0x1378
juil. 18 03:13:44 helios64 kernel:  __kmalloc_large_node+0x98/0x184
juil. 18 03:13:44 helios64 kernel:  __kmalloc+0xbc/0x110
juil. 18 03:13:44 helios64 kernel:  __regset_get+0xd8/0x12c
juil. 18 03:13:44 helios64 kernel:  regset_get_alloc+0x44/0x58
juil. 18 03:13:44 helios64 kernel:  elf_core_dump+0x784/0x1444
juil. 18 03:13:44 helios64 kernel:  do_coredump+0x152c/0x199c
juil. 18 03:13:44 helios64 kernel:  get_signal+0xd34/0xe20
juil. 18 03:13:44 helios64 kernel:  do_notify_resume+0x254/0x1b5c
juil. 18 03:13:44 helios64 kernel:  el0_svc+0x54/0x64
juil. 18 03:13:44 helios64 kernel:  el0t_64_sync_handler+0xbc/0x138
juil. 18 03:13:44 helios64 kernel:  el0t_64_sync+0x190/0x194
juil. 18 03:13:44 helios64 kernel: Mem-Info:
juil. 18 03:13:44 helios64 kernel: active_anon:54749 inactive_anon:60134 isolated_anon:0
                                    active_file:70081 inactive_file:136152 isolated_file:0
                                    unevictable:7559 dirty:241 writeback:0
                                    slab_reclaimable:106631 slab_unreclaimable:87937
                                    mapped:23823 shmem:6545 pagetables:5207
                                    sec_pagetables:0 bounce:0
                                    kernel_misc_reclaimable:0
                                    free:23558 free_pcp:441 free_cma:0
juil. 18 03:13:44 helios64 kernel: Node 0 active_anon:218996kB inactive_anon:240536kB active_file:280324kB inactive_file:544608kB unevictable:30236kB isolated(anon):0kB isolated(file):0kB mapped:95292kB dirty:9>
juil. 18 03:13:44 helios64 kernel: Node 0 DMA free:93980kB boost:0kB min:22528kB low:28160kB high:33792kB reserved_highatomic:2048KB active_anon:218996kB inactive_anon:240536kB active_file:280324kB inactive_fil>
juil. 18 03:13:44 helios64 kernel: lowmem_reserve[]: 0 0 0 0
juil. 18 03:13:44 helios64 kernel: Node 0 DMA: 3541*4kB (UEH) 2167*8kB (UMEH) 1567*16kB (UMEH) 1119*32kB (UMEH) 27*64kB (UMEH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 94108kB
juil. 18 03:13:44 helios64 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
juil. 18 03:13:44 helios64 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
juil. 18 03:13:44 helios64 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
juil. 18 03:13:44 helios64 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
juil. 18 03:13:44 helios64 kernel: 217886 total pagecache pages
juil. 18 03:13:44 helios64 kernel: 3477 pages in swap cache
juil. 18 03:13:44 helios64 kernel: Free swap  = 21179212kB
juil. 18 03:13:44 helios64 kernel: Total swap = 24042916kB
juil. 18 03:13:44 helios64 kernel: 1015296 pages RAM
juil. 18 03:13:44 helios64 kernel: 0 pages HighMem/MovableOnly
juil. 18 03:13:44 helios64 kernel: 162005 pages reserved
juil. 18 03:13:44 helios64 kernel: 32768 pages cma reserved

After the python3 error, I have:

phn@helios64:~$ LC_ALL=C free -mw
               total        used        free      shared     buffers       cache   available
Mem:            3333        2115          82          28          95        1162        1217
Swap:          23479        2779       20700

I have zram enabled and zswap disabled.

phn@helios64:~$ sudo zramctl
NAME       ALGORITHM DISKSIZE  DATA  COMPR  TOTAL STREAMS MOUNTPOINT
/dev/zram0 zstd          2,9G  2,7G 880,5M 916,6M       6 [SWAP]

and a swap partition on one of the HDD:

phn@helios64:~$ sudo swapon -s
Filename                Type        Size        Used        Priority
/dev/sdh2                               partition   20971072    0       -2
/dev/zram0                              partition   3071844     2845016     100

This might make more sense than the u-boot idea. Could you check journalctl -b -k if you manage to trigger the python3 invalid pointer issue with:

for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
jmue commented 1 year ago

So here is the ouput of the two systems of kernel cmdline, zramctl, swapon & the test case.

armbian upgraded from 21.08:

root@helios64.fritz.box's password: 
 _   _      _ _            __   _  _   
| | | | ___| (_) ___  ___ / /_ | || |  
| |_| |/ _ \ | |/ _ \/ __| '_ \| || |_ 
|  _  |  __/ | | (_) \__ \ (_) |__   _|
|_| |_|\___|_|_|\___/|___/\___/   |_|  

Welcome to Armbian 23.08.0-trunk Jammy with Linux 6.1.38-rockchip64

No end-user support: untested automated build

System load:   36%              Up time:       1 min    
Memory usage:  4% of 3.77G      IP:        192.168.20.69
CPU temp:      43°C             Usage of /:    17% of 7.2G      

Last login: Mon Jul 17 20:46:30 2023 from 192.168.20.64
root@helios64:~# cat /proc/cmdline 
root=UUID=939c3717-edfc-4ff6-b0b5-4b0058bda414 rootwait rootfstype=ext4 console=ttyS2,1500000 console=tty1 consoleblank=0 loglevel=1 ubootpart=8b029f90-01 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u   cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1
root@helios64:~# zramctl 
NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       1.9G   4K   74B   12K       6 [SWAP]
root@helios64:~# swapon -s
Filename                Type        Size        Used        Priority
/dev/zram0                              partition   1978500     0       5
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
root@helios64:~# free -mw
               total        used        free      shared     buffers       cache   available
Mem:            3864         219        3542           8          13         171        3645
Swap:           1932           0        1932
root@helios64:~# journalctl -b -k
Journal file /var/log/journal/4cd2c68aaa514c8282d4d1834ecf3199/system@33fa8ccbfc814963af7c8862c925ac54-0000000000003f31-000600ac98b20fd7.journal is truncated, ignoring file.
Jul 18 17:58:46 helios64 systemd-journald[767]: Received client request to relinquish /var/log/journal/4cd2c68aaa514c8282d4d1834ecf3199 acc>
Jul 18 17:58:46 helios64 systemd-journald[767]: Received client request to relinquish /var/log/journal/4cd2c68aaa514c8282d4d1834ecf3199 acc>
root@helios64:~# 

armbian new install

root@helios64.fritz.box's password: 
 _   _      _ _            __   _  _   
| | | | ___| (_) ___  ___ / /_ | || |  
| |_| |/ _ \ | |/ _ \/ __| '_ \| || |_ 
|  _  |  __/ | | (_) \__ \ (_) |__   _|
|_| |_|\___|_|_|\___/|___/\___/   |_|  

Welcome to Armbian 23.08.0-trunk Bookworm with Linux 6.1.38-rockchip64

No end-user support: community creations

System load:   18%              Up time:       1 min    
Memory usage:  4% of 3.77G      IP:        192.168.20.70
CPU temp:      38°C             Usage of /:    68% of 1.8G      

Last login: Mon Jul 17 20:28:20 2023 from 192.168.20.64
root@helios64:~# cat /proc/cmdline
root=UUID=9cc1f6a2-d1b6-46c1-ade3-4d66bbd1d54d rootwait rootfstype=ext4 splash=verbose console=ttyS2,1500000 console=tty1 consoleblank=0 loglevel=1 ubootpart=0bcbf642-01 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u   cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory swapaccount=1
root@helios64:~# zramctl 
NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       1.9G   4K   74B   12K       6 [SWAP]
root@helios64:~# swapon -s
Filename                Type        Size        Used        Priority
/dev/zram0                              partition   1978500     0       5
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
free(): invalid pointer
Aborted
root@helios64:~# free -mw
               total        used        free      shared     buffers       cache   available
Mem:            3864         224        3555           9           9         157        3640
Swap:           1932           0        1932
root@helios64:~# journalctl -b -k
Journal file /var/log/journal/cbf35eccb86d41699897e19be784fffa/system@53ab717da4a3495dbb21e03036011538-00000000000037af-000600b2ecfa6d29.journal is truncated, ignoring file.
Jul 18 17:51:01 helios64 systemd-journald[768]: Received client request to relinquish /var/log/journal/cbf35eccb86d41699897e19be784fffa acc>
Jul 18 17:51:01 helios64 systemd-journald[768]: Received client request to relinquish /var/log/journal/cbf35eccb86d41699897e19be784fffa acc>
root@helios64:~# 
jmue commented 1 year ago

I would also note that stress does not cause instability in either system:

root@helios64:~# stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 60s
stress: info: [1818] dispatching hogs: 8 cpu, 4 io, 2 vm, 0 hdd
stress: info: [1818] successful run completed in 60s
jmue commented 1 year ago

Here are the diffs taken with user rights for privacy reasons:

diff-boot.txt diff-initrd.txt diff-etc.txt

blmhemu commented 1 year ago

@jmue how did you upgrade the kernel ? I followed the steps in that article and while the debian version and packages did upgrade, the kernel did not.

jmue commented 1 year ago

@blmhemu Basically you change all the references of bullseye to jammy in /etc/apt/sources.list.d/armbian.list, also change the server url from http://apt.armbian.com to http://beta.armbian.com. Why jammy and not bookworm? Well armbian-bsp-cli-helios64-current depends on base-files version 23.08.0-trunk, which is not available in armbian bookworm. Then you run apt update && apt upgrade.

jmue commented 1 year ago

A hint could be the different detected version of psci. However, I do not understand how this happens, since I use the same kernel and the same dtb in both images.

As psci is the interface for power management, this could also point to the cpufreq direction. More context: An overview of PSCI

armbian upgraded:

root@helios64:~# journalctl -b | grep -e 'psci'
Jul 20 21:43:22 helios64 kernel: psci: probing for conduit method from DT.
Jul 20 21:43:22 helios64 kernel: psci: PSCIv1.0 detected in firmware.
Jul 20 21:43:22 helios64 kernel: psci: Using standard PSCI v0.2 function IDs
Jul 20 21:43:22 helios64 kernel: psci: MIGRATE_INFO_TYPE not supported.
Jul 20 21:43:22 helios64 kernel: psci: SMC Calling Convention v1.0
root@helios64:~# apt list *current-rockchip64
Listing... Done
linux-dtb-current-rockchip64/jammy,now 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64 [installed]
linux-headers-current-rockchip64/jammy 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64
linux-image-current-rockchip64/jammy,now 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64 [installed]
root@helios64:~# sha256sum /boot/dtb-6.1.39-rockchip64/rockchip/rk3399-kobol-helios64.dtb 
aa677e9d64998e535ea29ebeb4073954bf7dac43de3ec3bccf2e7affd3db85f7  /boot/dtb-6.1.39-rockchip64/rockchip/rk3399-kobol-helios64.dtb

armbian newly installed:

root@helios64:~# journalctl -b | grep -e 'psci'
Jul 20 21:06:14 helios64 kernel: psci: probing for conduit method from DT.
Jul 20 21:06:14 helios64 kernel: psci: PSCIv1.1 detected in firmware.
Jul 20 21:06:14 helios64 kernel: psci: Using standard PSCI v0.2 function IDs
Jul 20 21:06:14 helios64 kernel: psci: MIGRATE_INFO_TYPE not supported.
Jul 20 21:06:14 helios64 kernel: psci: SMC Calling Convention v1.2
root@helios64:~# apt list *current-rockchip64
Listing... Done
linux-dtb-current-rockchip64/jammy,now 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64 [installed]
linux-headers-current-rockchip64/jammy 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64
linux-image-current-rockchip64/jammy,now 23.08.0-trunk--6.1.39-Sa456-D710f-P5d5f-C2dc9Hfe66-HK01ba-Vc222-B3b41 arm64 [installed]
root@helios64:~# sha256sum /boot/dtb-6.1.39-rockchip64/rockchip/rk3399-kobol-helios64.dtb 
aa677e9d64998e535ea29ebeb4073954bf7dac43de3ec3bccf2e7affd3db85f7  /boot/dtb-6.1.39-rockchip64/rockchip/rk3399-kobol-helios64.dtb
prahal commented 1 year ago

@jmue do you have a serial console to grab U-Boot output? Else you could try to get the U-Boot strings from the SD, but the serial console output would tell more.

I doubt the cpufreq hw hangs have anything to do with this python3 issue. Don't you have allocation failures in the kernel logs when python3 crash? (journalctl -k or dmesg). COuld you paste your kernel allocation failure error if any?

Though it might still be hardware related. Something like the rockchip hwrng being instable. I have not yet tried with it disabled in the dtb. But unlikely, as the kernel errors with allocation failures each time I get this python invalid pointer error. Could you give your cat /proc/buddyinfo output?

I am investigating the hw instability while cpufreq is switching and I had improvements (my board is hw crashing since I plugged it the first time). My tests are around plugging the helios64 power adapter into the wall plug against my UPS surge protector socket or the battery-backed one (before a few days ago, I always had the helios64 plugged in a UPS, first an APC one then an Eaton). Seems it was way more unstable when plugged into the surge-protected sockets. I have a power meter plug between the helios64 power adapter and the socket I plug it into. The RMS voltage looks fine (between 235V and 240V while the nominal value is 230V). The ground looks fine and alike both on the wall plug and on the surge protect socket of the UPS. It might be the other devices I plug into the UPS. Or it was merely a coincidence that plugging the helios64 into the wall plug fixed the cpufreq hw hangs.

Also, I came to suspect the issue might be related to the power adaptor (and maybe only the EU one...). For one I had static discharge when I touched the helios64 enclosure until I move the helios64 power adapter out of the box where I have my UPS and all the power adapters. I still need to reproduce with the adapter back in the box to confirm. But could be either the adapter or the power plug on the board is missing a component to avoid the current from going back (this is only a hypothesis as of now, but my guess is that somehow a current gets back to the helios64). Sadly I don't have an ATX power supply to test if supplying power via the ATX socket on the board also gets rid of the discharge from the enclosure.

Note my armbian upgraded to bookworm and linux 6.3.12-rockchip64 (armbian -edge) has the invalid free issue but show the same psci than your newly installed one:

[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.1 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.2

the firmware that the kernel connect to by the psci interface seems to be ATF which is shipped with U-boot https://opensource.rock-chips.com/wiki_Boot_option#u-boot.itb (shipped by linux-u-boot-helios64-edge, but not that imply installing the package does not install the new u-boot, you have to apply it manually with nand-sata-install, or call nand-sata-install via armbian-config > System > Install > Install/Update the bootloader on SD/eMMC).

I believe the next step is to try the current u-boot with the old ATF to see if the issue is with the ATF firmware (that seems to run at the same time as the kernel) or u-boot (that only initializes the board).

By the way, it would be great if anybody with a rk3399 based board tried the current ATF/U-boot and ran the python 3.11 test case to see if really helios64 is the only affected board: for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done because if the issue is with ATF it makes no sense for this issue to be helios64 specific. If with u-boot that would narrow the issue to helios64-specific initialization bits.

Note, you can check your installed ATF with U-Boot output:

NOTICE:  BL31: v2.6(release):a1f02f4f-dirty
NOTICE:  BL31: Built : 11:05:57, Feb 15 2023

then ATF is version v2.6.

It might be that you have to update the emmc ATF/U-Boot while the nand-sata-install script will install them to the SD card (as it rely on kernel root partition). But helios64 by default load the bootloader from emmc (except if you set a physical jumper). You can simply extract the commands to write the bootloader to the emmc from the u-boot package (for armbian edge linux-u-boot-helios64-edge the /usr/lib/u-boot/platform_install.sh script): that is if emmc is mmcblk1: change directory to the directory on the first line of /usr/lib/u-boot/platform_install.sh, here the first line is DIR=/usr/lib/linux-u-boot-edge-helios64_23.08.0-trunk_arm64

sudo dd if=./idbloader.img of=/dev/mmcblk1 seek=64 conv=notrunc status=none
sudo dd if=./u-boot.itb of=/dev/mmcblk1 seek=16384 conv=notrunc status=none

Setting BOOT_SCENARIO=tpl-blob-atf-mainline in config/board/helios64.csc and:

./compile.sh uboot BOARD=helios64 BRANCH=edge RELEASE=bookworm KERNEL_CONFIGURE=no BUILD_KSRC=no

then installing as told above. Reboot, I cannot get my test case to fail. for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done ten times and no issue.

So the only required change for this issue is the rockchip ddrbin blob which well set up the lpddr4. There are not many rk3399 boards with lpddr4 (Nanopi M4 v2 is one). Could be the settings in u-boot mainline are wrong. But they are copyrighted by rockchip themselves, as the ddrbin blob... One difference I see is the ddrbin blob setup the memory frequencies to 416MHz then 856MHz (and tell support 416 856 328 666 928 MHz, current 856MHz) at the end of its init, while the u-boot TPL code set the frequencies to 400MHz then 800MHz.

jmue commented 1 year ago

You are absolutely right, my ATF on the "good" card is:

NOTICE:  BL31: v1.3(debug):42583b6
NOTICE:  BL31: Built : 07:55:13, Oct 15 2019
NOTICE:  BL31: Rockchip release version: v1.1

The frequencies are as you described in your comment above. Where do we go from here? Should we use the mainline ATF or the older rockchip ATF? I just watched U-Boot as PSCI provider on ARM64, which has some additional background on this topic.

Thank you for your advice on where to look at.

As for the Python crashes, I don't see any kernel logs when it crashed.

root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
free(): invalid size
Aborted
root@helios64:~# journalctl -k
Journal file /var/log/journal/cbf35eccb86d41699897e19be784fffa/system@2186cf7b4d244bb091d40cbe1bb232c4-0000000000007d0a-000600efd7a8b2b7.journal is truncated, ignoring file.
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
munmap_chunk(): invalid pointer
Aborted
root@helios64:~# journalctl -k
Journal file /var/log/journal/cbf35eccb86d41699897e19be784fffa/system@2186cf7b4d244bb091d40cbe1bb232c4-0000000000007d0a-000600efd7a8b2b7.journal is truncated, ignoring file.
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>
root@helios64:~# for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
Segmentation fault
root@helios64:~# journalctl -k
Journal file /var/log/journal/cbf35eccb86d41699897e19be784fffa/system@2186cf7b4d244bb091d40cbe1bb232c4-0000000000007d0a-000600efd7a8b2b7.journal is truncated, ignoring file.
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>
Jul 21 21:28:20 helios64 systemd-journald[765]: Received client request to reli>

and my buddyinfo

root@helios64:~# cat /proc/buddyinfo 
Node 0, zone      DMA    571    480    186     87     42     10      3      3      3      4    880 
prahal commented 1 year ago

@jmue ATF is fine at least I was unable to reproduce the issue at stake with current ATF. But I was not able to repdouce the bug using rockchip ddrbin instead of u-boot version of the ram initialization code. The only way out that I know of as of yet (ie reverting to the ddrbin rockchipp closed source initialization code) is: add BOOT_SCENARIO=tpl-blob-atf-mainline in config/board/helios64.csc and:

./compile.sh uboot BOARD=helios64 BRANCH=edge RELEASE=bookworm KERNEL_CONFIGURE=no BUILD_KSRC=no

then install the package and apply the firmware to the emmc drive as I told you above.

I am currently trying to upgrade to u-boot v2022.10 as there were a few rockchip-related ram commits around September 2022.

Note that I have been able to reproduce this python3 error invalid-free without getting a kernel allocation failure. But your setup helps me discard the out-of-memory fragmentation issue on my side, at least as a cause for this bug.

prahal commented 1 year ago

@blmhemu could you give your serial console U-Boot output so we can check you do not have the supposedly working ddrbin binary from rockchip loaded by U-Boot? Then you could compile and install U-Boot as I explained on this page (writing it to the emmc else it will not apply) to check it "workaround" the issue (in that well it works but it would be better to fix the U-Boot ram initialization code than reintroduce a closed binary blob).

blmhemu commented 1 year ago

@prahal here is the boot serial console - https://pastebin.pl/view/bd5f5d71 I have followed the upgrade path from bullseye (stable 21.08.2) to bookworm.

prahal commented 1 year ago

@blmhemu intersting. You already use the ddrbin DDR initialization in the bootloader.

Could you try with the latest u-boot, atf, and ddrbin ? instructions at https://github.com/armbian/build/issues/4761#issuecomment-1646211095

If older ddrbin (the latest with the above u-boot build instruction is "DDR Version 1.25 20210517" while yours is "Version 1.24 20191016") was not working it might make sense out of the fact I had various crashes even with ddrbin. Or it could be the issue was fixed by a mix of ATF, U-Boot fixes above the latest rockchip ddrbin.

You can test if the bootloader upgrade helped by running this command a few times (six times should be enough to be confident the issue is gone):

for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
blmhemu commented 1 year ago

@prahal I am sorry (I could not follow how to do the install) I did the build and got a linux-uboot-(*).deb. Can you help me with the detailed next steps ? (Should I copy this over to the helios and do dpkg install ?)

FWIW, I currently have the working (userspace upgraded) bookworm on my nvme / emmc combo (with the edge 6.1.11 kernel).

EDIT: Copied over the the u-boot and tried sudo dpkg -i linux-u-boot-helios64-edge_23.08.0-trunk--2022.07-Se092-P60ec-He8c0-V1c60-Bb358_arm64.deb it failed with the following error

dpkg: regarding linux-u-boot-helios64-edge_23.08.0-trunk--2022.07-Se092-P60ec-He8c0-V1c60-Bb358_arm64.deb containing linux-u-boot-helios64-edge:
 linux-u-boot-helios64-current conflicts with armbian-u-boot
  linux-u-boot-helios64-edge provides armbian-u-boot and is to be installed.

dpkg: error processing archive linux-u-boot-helios64-edge_23.08.0-trunk--2022.07-Se092-P60ec-He8c0-V1c60-Bb358_arm64.deb (--install):
 conflicting packages - not installing linux-u-boot-helios64-edge
Errors were encountered while processing:
 linux-u-boot-helios64-edge_23.08.0-trunk--2022.07-Se092-P60ec-He8c0-V1c60-Bb358_arm64.deb
igorpecovnik commented 1 year ago

uninstall installed u-boot package ...

blmhemu commented 1 year ago

Tried sudo apt remove --purge armbian-u-boot- says something like Cannot uninstall virtual packages

igorpecovnik commented 1 year ago

correct, uninstall real package linux-u-boot-helios64-current

blmhemu commented 1 year ago

Uninstalled the above package and installed using dpkg -i Here is the new serial console output - https://pastebin.com/MmtpS7F9 It seems like ddrbin did not update ? Ran for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done -> could not repro issue.

igorpecovnik commented 1 year ago

installing packages does't flash u-boot. this is done via armbian-install utility.