armbian / build

Armbian Linux build framework generates custom Debian or Ubuntu image for x86, aarch64, riscv64 & armhf
https://www.armbian.com
GNU General Public License v2.0
4.25k stars 2.32k forks source link

free() invalid pointer #4761

Closed blmhemu closed 1 year ago

blmhemu commented 1 year ago

What happened?

Getting free() invalid pointer issue when installing / using python3. Could be dpkg issue as well ! Board: Helios64 (I know, I know CSC) Chipset: RK3399

Screenshot 2023-01-28 at 9 44 08 AM

  ansible_facts: {}
  failed_modules:
    ansible.legacy.setup:
      ansible_facts:
        discovered_interpreter_python: /usr/bin/python3
      failed: true
      module_stderr: |-
        free(): invalid pointer
        Aborted
      module_stdout: ''
      msg: |-
        MODULE FAILURE
        See stdout/stderr for the exact error
      rc: 134
  msg: |-
    The following modules failed to execute: ansible.legacy.setup

When I did sudo apt update && sudo apt upgrade, it happened and hence I tried to reinstall. This issue occurs when trying to manage it with ansible as well. I think something might be wrong with the latest python.

Branch

master (main development branch)

On which host OS are you observing this problem?

Jammy

Relevant log output

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 2 reinstalled, 0 to remove and 0 not upgraded.
Need to get 0 B/472 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 44574 files and directories currently installed.)
Preparing to unpack .../python3-pkg-resources_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
double free or corruption (out)
Aborted
dpkg: warning: old python3-pkg-resources package pre-removal script subprocess returned error exit status 134
dpkg: trying script from the new package instead ...
dpkg: ... it looks like that went OK
Unpacking python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Preparing to unpack .../python3-setuptools_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
Unpacking python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
free(): invalid pointer
Aborted
dpkg: error processing package python3-setuptools (--configure):
 installed python3-setuptools package post-installation script subprocess returned error exit status 134
Errors were encountered while processing:
 python3-setuptools
E: Sub-process /usr/bin/dpkg returned an error code (1)

Code of Conduct

prahal commented 1 year ago

Uninstalled the above package and installed using dpkg -i Here is the new serial console output - https://pastebin.mozilla.org/YjfnuShC It seems like ddrbin did not update ? Indeed as Igor hinted the package only put the u-boot files in a folder. You could do the u-boot upgrade via armbian-installthen choose5 'Install/Update the bootloader on SD/eMMC'`.

But please keep in mind that if you do not run your OS from emmc and want to install to EMMC this will not do the job. That is armbian-install write the bootloader to: lsblk -ndo pkname $root_partition "$(sudo blkid | tr -d '":' | grep "$(sed -e 's/^.*root=//' -e 's/ .*$//' < /proc/cmdline)" | awk '{print $1}')" which points to the SD if you boot to the SD, not the emmc (and the helios64 looks for u-boot on the emmc first, except if you put a jumper to disable the emmc on the helios64 board). You can run this command to find which it will install to, or read the "root=" value in /proc/cmdline and compare it to the blkid output.

To manually install to the emmc when you boot on the SD. The linux-u-boot-helios64-edge install a script for armbian-install to know how to install u-boot on the helios64 /usr/lib/u-boot/platform_install.sh and the u-boot files to write to the board at /usr/lib/linux-u-boot-edge-helios64_23.08.0-trunk_arm64 .

For the case where this folder contains u-boot.itb the platform_install.sh script tells to execute:

dd if=$1/idbloader.img of=$2 seek=64 conv=notrunc status=none;
dd if=$1/u-boot.itb of=$2 seek=16384 conv=notrunc status=none;

where $1 is the folder containing the bootloader binaries, here /usr/lib/linux-u-boot-edge-helios64_23.08.0-trunk_arm64 and $2 is the device file for the storage you want to write u-boot to, here on linux 6.3 emmc is /dev/mmcblk1 (you can check with lsblk -f).

For one if your u-boot upgrade leaves you with a broken u-boot on the emmc, then you add the jumper on the board, halt u-boot, remove the jumper, rescan the mmc device with mmc rescan, then enter the boot command in u-boot and finally from the linux install on the SD, write manually the u-boot to the emmc.

Ran for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done -> could not repro issue.

Could you try running this command at least six times before upgrading the u-boot? This is to sort out if your old rockchip64 DDR blob is really also broken with regards to setting up the DDR parameters. That is if you cannot reproduce then there might be other issues (though unlikely). Though now that we know which version of this blob you had on could reproduce it by tweaking armbian build to install it with this blob. Mind I will be away a few weeks very soon, so I won't be able to test such an install for quite a while.

blmhemu commented 1 year ago

Could you try running this command at least six times before upgrading the u-boot?

Done - still stable

Updated the uboot (ddrbin now shows 1.25) - see https://pastebin.mozilla.org/M1XXJnLn Could not repro the error ! (Ran 6 times + multiple ansible runs)

prahal commented 1 year ago

@blmhemu then I believe when you did the latest bullseye install somehow you modified the installed bootloader. I believe you do not have a log of the u-boot output from when you had the invalid free issue, else could you give it?

Probably the 21st of February 2023 when you told us you got the issue fixed (sorry I forgot you already had a fixed setup, I though you were still suffering this issue):

In the meanwhile, I did install the latest bullseye from this armbian mirror and it works smoothly with ansible and could not see the free issue.

So likely the rockchip DDR blob 1.24 is fine too.

blmhemu commented 1 year ago

I believe you do not have a log of the u-boot output from when you had the invalid free issue, else could you give it?

Unfortunately, I do not have those logs :(

blmhemu commented 1 year ago

@prahal Update: I was able to flash the recently compiled build (unstable with free issue) - Here are the logs you asked for https://pastebin.com/zGjvrvux

I diffed both the logs and here are my findings

The unstable build

lpddr4_set_rate: change freq to 400000000 mhz 0, 1  
lpddr4_set_rate: change freq to 800000000 mhz 1, 0  
Trying to boot from BOOTROM 
Returning to boot ROM...

vs

The stable build

ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz

Link to diff https://www.diffchecker.com/3D0UDOHx/ Left is unstable. Right is stable.

blmhemu commented 1 year ago

UPDATE (Again):

Setting BOOT_SCENARIO=tpl-blob-atf-mainline in config/board/helios64.csc and:

I have compiled armbian with the above option and flashed it. I could NOT reproduce the free issue now. 🥳 🥳 🥳 🥳 I see also see the ddrbin logs in the serial console.

May be we found the root cause ? (u-boot tpl) Link to serial console logs - https://pastebin.com/zvhsyF2R

Observations

blmhemu commented 1 year ago

UPDATE 3: I was able to boot fedora !!! Using the above u-boot and idbloader and following the steps at https://fedoraproject.org/wiki/Architectures/ARM/Installation

Ran the python loop for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done and no free error.

Observations

prahal commented 1 year ago

@blmhemu about the DDR frequencies, I added the ddrbin freq to blob less u-boot (keeping all other ddr parameters the same which is probably not fine) and forced them. Still the same issue (though I should post the hack for this issue to be reproduced by others but for one I am away for a few weeks). Changing the frequencies is not enough. At least one have to tweak the DDR parameters in u-boot lpddr4 inc files. But those are pretty cryptic to me. And we don't not have these parameters from rockchip it seems (I believe they are on th DDR binary blob). Or maybe it is just that the DDR blob does two training, one before setting the freq to 416MHz and one after with an added advanced training afterwards.

Mind v2023.04 has a fix to do the training at 400MHz instead of 50MHz bit this did not help with our issue.

About he SATA/nvme, maybe look on the kobold wiki, probably in the comments I am confident this was answered. (I also made an u-boot 2023.04 build that seems to have a pretty good support for SATA, but it requires to migrate to new apis (bootlow, bootdev, boothmeth). I want to spend time sharing this hack of a build but it turned out it did not help with the DDR stability issue so it became lower priority. Though I believe instruction to achieve SATA boot are already available in the kobold wiki. If not tell me I will try to share my u-boot v2023.04 for Helios64 build. Mind this build had an issue that it can boot loop in u-boot (I manage to stop the loop but did not investigate the cause yet). So pretty experimental. And somehow u-boot v2022.10 I believe was the version that was not buidlable as it partially migrate to bin man binary build while still being half makefile based. So I was not able to build both the idbloader.imh and u-boot.itb binaries. All in all I attempted those to try the new DDR related fixes in these version which ended up not being related to this invalid free bug. Either way you have to keep u-boot on emmc and set it up to boot the kernel from the SATA (mind the m.2 slot on the Helios64 is SATA not pcie).

About the eth0 error "Net: dw_dm_mdio_init" I always though it had always been so. I will take a look if I can get this working but not asap I believe ( out of that being an easy catch). Do you know with which u-boot was it working?

blmhemu commented 1 year ago

Do you know with which u-boot was it working?

2020.10 - https://pastebin.com/MmtpS7F9

blmhemu commented 1 year ago
dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

prahal commented 1 year ago
dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

Do you mean u-boot load the kernel then nothing or an error on the serial console?

By the way this looks like another issue and to avoid this thread becoming unreadable I guess this requires a thread of it's own on the armbian forum. Feel free to tag me in your forum thread so I get a notice by email.

blmhemu commented 1 year ago
dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

If anyone encounters this - this is due to the armbain provided linux-libc-dev (use the debian one instead by giving a lower priority to the armabian repo)

prahal commented 1 year ago

I have upgraded the system (apt update && apt upgrade) and could not boot now.

If anyone encounters this - this is due to the armbain provided linux-libc-dev (use the debian one instead by giving a lower priority to the armabian repo)

Thanks for the follow-up and workaround. Feel free to open another bug report to track this issue! It might even be an armbian rockchip64 family issue recently introduced.

I am currently trying a few ideas as I was able to reproduce the raid10 resync always crashing the kernel on helios64 I randomly have since I received the unit. I will try to sort out which of the ideas are useless against this issue (I even had hints that it could be related to the HDDs firmwares above the SATA/pci bridge (rk3399 pcie is known to have bugs, but I suspect at the very least it is not the known issue which affects pcie devices being slow to enumerate). Or it could be another memory ddr corruption that the mdadm raid10 resync stresses and is the only test case to reliably reproduce its crashes. At least my old issue predates our current one which requires the ddr rockchip blob to avoid memory corruption from python3 as the initial u-boot form kobol had already this ddr rockchip blob. I believe I could workaround this crasher but I would really like to sort the cause of this issue (even if hardware related). In the meantime, I cannot boot my helios64.

snakekick commented 1 year ago

Hi there, I have the same (free(): invalid pointer) problem. I notice this after upgrading my helios64 from debian 11 to 12. I also have kernel problems when I run snapraid sync.

This problem is solved when I add cpufreq.off=1, but then the cpu is really slow. Is it possible to share the new, working armbian-u-boot dpkg? thanks my current uboot log https://pastebin.pl/view/80f4c9e7

jmue commented 1 year ago

@prahal : Is there anything wrong with opening a pull request until a better solution is found?

snakekick commented 1 year ago

@prahal Thank you! your fix solved my helios64 problem. Best of all, I can now run my Helios at full speed 400>1800MHz on demand. This was not possible before and it looks very stable (which I can say 12h later). But after installing your fix, I am able to run for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done 6 or more times and start a snapraid sync that crashed before. Thank you very much. //edit :

Rejoyed too soon!


kernel:[47341.023705] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP

Message from syslogd@helios64 at Aug 22 10:59:53 ... kernel:[47341.023705] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP

Message from syslogd@helios64 at Aug 22 10:59:53 ... kernel:[47341.045273] Code: aa1c03e0 93407c62 2a0803e1 9400819e (a9408261)

prahal commented 1 year ago

@snakekick indeed the kernel crashes are not fixed. I do not know if I can sort this instability issue on my side. Not worked much on it for a month due to life and will probably not for a while.

Seems to me this a memory corruption. It affects random kernel code.

But I left the helios64 down for a while as I have a way to reproduce the kernel corruption fast, that is boot when the raid 10 had a bad crash and is healing at boot.

Might still be memory related. I even was able to reproduce the crash with cpufreq turned off but way less often. Seems cpufreq up the risk of the bug to trigger but is not the cause.

Though we should discuss this matter in the forum as the current issue I believe is not the same and we have a workaround for. Though I believe we should push this fix to armbian repo I cannot asap.

d3473r commented 1 year ago

Hi @prahal and @snakekick, I'm also running a Helios64 wich freezes randomly every few days now :/ Can you please explain how to compile and flash the fixed uboot?

bcecchinato commented 1 year ago

@d3473r I've downgraded the bootloader as well, so far no more freezes. here is the way to do it :

cd /tmp
wget --content-disposition https://imola.armbian.com/apt/pool/main/l/linux-u-boot-helios64-edge/linux-u-boot-edge-helios64_22.02.1_arm64.deb
dpkg -x linux-u-boot-edge-helios64_22.02.1_arm64.deb linux-u-boot-edge-helios64_22.02.1_arm64/
vi /usr/lib/u-boot/platform_install.sh

While in the /usr/lib/u-boot/platform_install.sh, copy/paste the first line and change it to the new directory :

#DIR=/usr/lib/linux-u-boot-current-helios64
DIR=/tmp/linux-u-boot-edge-helios64_22.02.1_arm64/usr/lib/linux-u-boot-edge-helios64_22.02.1_arm64

Then launch armbian-install to update the bootloader.

I strongly suggest to dump your current bootloader just in case : dd if=/dev/mmcblk0 of=bootloader-backup.img bs=512 count=65535. Should this fail and you might brick your device.

d3473r commented 1 year ago

Hi @bcecchinato, i installed the bootloader with you instructions, /dev/mmcblk0 didn't exist on my machine, I dumped /dev/mmcblk1.

It runned for a while after a reboot but eventually freezed again after a few hours :(

bcecchinato commented 1 year ago

@d3473r yep it crashed on my side this morning as well :( Depending on which storage you are (emmc/sd card), the /dev/mmcblk will change indeed.

I'm trying with another bootloader here : wget --content-disposition https://imola.armbian.com/apt/pool/main/l/linux-u-boot-helios64-current/linux-u-boot-current-helios64_21.08.9_arm64.deb.

Since I don't really know what this changes, maybe this attempt is useless at all :D and unfortunately i'm not an expert with armbian/bootloader and etc. I can make some tests if other users from this topic want however.

d3473r commented 1 year ago

Are you running Debian 11 or 12? I'm on Debian 11 with OMV 6

bcecchinato commented 1 year ago

I'm on Debian 12, the free issue started with bookworm, no issues with Bullseye and the latest bootloader (but I can't say if it was uptodate or not).

prahal commented 1 year ago

@d3473r the free issue is not the same as the freeze one. What I mean is that you can fix the free issue but still have the freezes as I do.

Still, I am chasing the freeze issue too. Currently have the helios64 down for weeks since it is in a state were I can reproduce the freeze. That is raid10 resyncing at boot.

I would like to have a bug report to centralize the freeze issue reports. As of now, they are scattered in various threads on the Armbian forum. Maybe you could open a new one there and give the link here?

Note that I have freezes since I got the helios64. I have not changed my setup much since then (raid10 with WD Red drives). Could you elaborate on when this started for you?

At one point I blamed the rk3399 pcie ... but I am unsure now. Mind my raid10 stress the pcie in the SOC and the sata controller. Or memory timings. Still not diagnosed, but learning in the process (like the role of the ATF firmware).

So having other setup details could help this. Especially what the setups that were or are working are like.

bcecchinato commented 1 year ago

@prahal I don't know if the free and freezes are related, but since my downgrade to 21.08.9 of the bootloader i havn't encountered neither free error, nor freezes. I'm running on a uSD card, bootloader installed on the uSD card as well (the EMMC is completely blank, i've dd zeroes to be sure not to boot on it).

My case is a bit different, I had a uSD on debian bullseye, and made a fresh install on a second uSD card with bookworm. The troubles started from this point. I haven't deleted the old card, I can make some diffs between each in case this might help. Both cards have the same boot loader version (the 23.08.1 version), but I can't say if the bootloader written on the old uSD is 23.08.1 or 21.08.9.

I wish I could help more, but like you, I've no skills on bootloaders :(

The only sure thing is : bookworm with 21.08.9 bootloader is working fine and has no free issues at all.

d3473r commented 1 year ago

@prahal If have a pretty good understanding when the freezes started, but no why. Have to investigate the system log.

My helios64 is used as a Timemachine backup target, and the backups started failing since the beginning of September. These Backups ran for over a year (since August 2022) without any freeze.

I'm certain about this as the root filesystem is encrypted and any freeze or reboot would have forced me to unlock the root fs via ssh to boot the helios64 up again.

So my guess is: I updated something in the beginning of September (presumably kernel updates, i have not made a dist upgrade) and since then the freezes are occuring

prahal commented 1 year ago

@d3473r you have a history of the upgrades in /var/log/apt/history.log<.n.gz>.

Note that knowing the previous working versions is even more interesting than the new broken one.

Also, it could be the new version is only more efficient and thus stresses the hardware more (or even enables a new hardware component).

When you say they ran over a year without a freeze, you mean there were also freezes beforehand. Were they rare before that time?

I bet you never upgraded the bootloader before you did recently. Do you know from which image you installed the EMMC or SD card initially? One might be able to guess the older bootloader from that. If you have a log of the previous u-boot output on boot that would tell but it is unlikely you have one stored.

Also, it could be the load to the hardware changed over time and even without any upgrade you will have ended up with this freeze. Could you tell me your storage layout (FS, LUKS, raid or not, which raid, brand and model of hard drives and maybe the smartctl output for them ie firmware version - probably smartctl -a /dev/sd<x> for each drive).

Also, do you have small static discharges when touching the helios64 enclosure? I am pretty sure this is unrelated nowadays but who knows (I have them when my helios64 power adapter is close to my UPS and set of other chargers (not sorted which one yet).

d3473r commented 9 months ago

Hi @prahal, i did a complete new installation with kernel: Linux helios64 6.1.63-current-rockchip64 after this fix: https://github.com/armbian/build/pull/6066 The helios64 NAS ist now stable for one straight week, so this issue seems fixed :) Thank you so much