balena-os / balena-radxa

https://www.balena.io/os/
Apache License 2.0
14 stars 14 forks source link

Release rockpi-4b-rk3399 to production with only automated tests #129

Open jellyfish-bot opened 2 years ago

jellyfish-bot commented 2 years ago

[klutchell] undefined

jellyfish-bot commented 2 years ago

@acostach sorry for all the pings, but this week I'm trying to go through each of the devices and see whats blocking them -

I've just tried to run tests on the rockpi testbot, and the DUT supposedly isn't powering on:

Sep 13 08:20:13 2a42142 f2f7283b67c2[1604]: Booting DUT with the balenaOS flasher image
Sep 13 08:20:13 2a42142 f2f7283b67c2[1604]: waiting for DUT to be on
Sep 13 08:20:13 2a42142 f2f7283b67c2[1604]: 
Sep 13 08:20:13 2a42142 f2f7283b67c2[1604]: DUT is currently Off
Sep 13 08:20:18 2a42142 f2f7283b67c2[1604]: waiting for DUT to be on
Sep 13 08:20:18 2a42142 f2f7283b67c2[1604]: 
Sep 13 08:20:18 2a42142 f2f7283b67c2[1604]: DUT is currently Off
Sep 13 08:20:23 2a42142 f2f7283b67c2[1604]: waiting for DUT to be on
Sep 13 08:20:23 2a42142 f2f7283b67c2[1604]: 
Sep 13 08:20:23 2a42142 f2f7283b67c2[1604]: DUT is currently Off
Sep 13 08:20:28 2a42142 f2f7283b67c2[1604]: waiting for DUT to be on
Sep 13 08:20:28 2a42142 f2f7283b67c2[1604]: 
Sep 13 08:20:28 2a42142 f2f7283b67c2[1604]: DUT is currently Off
Sep 13 08:20:33 2a42142 f2f7283b67c2[1604]: waiting for DUT to be on

This is either because:

jellyfish-bot commented 2 years ago

[acostach] @rcooke-warwick checking now, the device has the green light on, which means it's powered, but the ethernet LEDs are off, so it's not booting. could be due to the mux/sd-card. Can I remove it from the testbot and try boot it with the already flashed sd-card in the mux?

jellyfish-bot commented 2 years ago

@rcooke-warwick looks like it might be because of the voltage, I will make a PR to increase it from 5 to 12. On Radxa website they say that it works with 5V but may cause stability issues once the load rises, and this is what I see locally now, it powers off during flashing with 5V but not with 12V.

jellyfish-bot commented 2 years ago

nice find @acostach sounds like a good idea to increase it

jellyfish-bot commented 2 years ago

[acostach] done, I will merge https://github.com/balena-io-hardware/testbotsdk/pull/48 once checks pass and then update leviathan worker, after that we can test again.

jellyfish-bot commented 2 years ago

@rcooke-warwick I updated the testbotsdk to increase the voltage and also leviathan-worker, where I merged https://github.com/balena-os/leviathan-worker/pull/26

jellyfish-bot commented 2 years ago

[acostach] 1) Is answered, the rig app updated already

jellyfish-bot commented 2 years ago

@acostach nice one, the rig did update a while back and I retried the rockpi job, its flashing at the moment, will report back here with the result

jellyfish-bot commented 2 years ago

@acostach update, unfortunately it hasn't worked - 12v is coming out of the tesbot but we're still getting: Sep 13 11:03:40 2a42142 eda46eb94756[1604]: DUT is currently Off Sep 13 11:03:45 2a42142 eda46eb94756[1604]: waiting for DUT to be on

jellyfish-bot commented 2 years ago

[rcooke-warwick] now I'm wondering if there's something going wrong with the detection of if the DUT is on/off ...

jellyfish-bot commented 2 years ago

[rcooke-warwick] does the rockpi have ethernet?

jellyfish-bot commented 2 years ago

[rcooke-warwick] (I remember the rockpi flashing has worked before, but maybe I'm remembering wrong)

jellyfish-bot commented 2 years ago

@rcooke-warwick I plugged and unplugged the cable, it should flash the device once again

jellyfish-bot commented 2 years ago

[acostach] I recall we did run tests on this DT before and they were running, provisioning worked

jellyfish-bot commented 2 years ago

[acostach] it's not powering off, is the test running normally?

jellyfish-bot commented 2 years ago

[rcooke-warwick] which cable did you unplug/replug?

jellyfish-bot commented 2 years ago

[acostach] ethernet and usbc

jellyfish-bot commented 2 years ago

usb-c is the power cable

jellyfish-bot commented 2 years ago

[rcooke-warwick] hmm its just staying "on" now

jellyfish-bot commented 2 years ago

@acostach yep, the device names cross wires - I did realize and eventually created and linked a new ticket. Sorry for the noise.

jellyfish-bot commented 2 years ago

want me to plug and unplug the power cable @rcooke-warwick ? That would trigger the re-flashing IF the sd-card is switched to DUT

jellyfish-bot commented 2 years ago

I already turned the DUT off then on again to try to achieve that @acostach

jellyfish-bot commented 2 years ago

[rcooke-warwick] the DUT remained to stay on forever - so for some reason it isn't internally flashing the DUT

jellyfish-bot commented 2 years ago

[rcooke-warwick] the device is currently in this state, the test job is still running

jellyfish-bot commented 2 years ago

[rcooke-warwick] retrying with fresh slate

jellyfish-bot commented 2 years ago

[acostach] ok. if it still doesn't work let me know and I'll hook up the serial cable from my PC to the device and kick the suite, see where it hangs

jellyfish-bot commented 2 years ago

@acostach rockpi seems to be flashing now: https://jenkins.product-os.io/job/leviathan-v2-template/4673/console - I've flashed it 3 times from a local test job in a row, and now this jenkins one is running - maybe there was just something loose that got fixed when you unplugged and replugged

jellyfish-bot commented 2 years ago

very possible @rcooke-warwick , good thing it's working now, thanks for letting me know as I was just going to connect the serial and restart it

jellyfish-bot commented 2 years ago

[rcooke-warwick] @acostach it has been consistently flashing every time last night and this morning. Now we move on to the problem of tests failing. First roadblock is the test here: https://github.com/balena-os/meta-balena/blob/master/tests/suites/os/tests/chrony/index.js#L157

Which has failed both times I've tried it. This test I;m not that familiar with, but here's what I get from it:

just running it again now to get the journal logs to see why that might be happening.

jellyfish-bot commented 2 years ago

[rcooke-warwick] furstratingly, that test has now passed...

I think it is linked to this issue: https://github.com/balena-os/meta-balena/issues/2758

From what I've seen, in the case of failure, chronyc is started with some sort of wrong permissions:

Sep 14 08:08:44 b0105db healthdog[5665]: 2022-09-14T08:08:44Z Wrong permissions on /run/chrony
Sep 14 08:08:44 b0105db healthdog[5665]: 2022-09-14T08:08:44Z Disabled command socket /run/chrony/chronyd.sock
Sep 14 08:08:44 b0105db healthdog[5665]: 2022-09-14T08:08:44Z Running with root privileges
Sep 14 08:08:44 b0105db healthdog[5665]: 2022-09-14T08:08:44Z Frequency 0.000 +/- 1000000.000 ppm read from /var/lib/chrony/drift
Sep 14 08:08:44 b0105db healthdog[5668]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Sep 14 08:08:44 b0105db healthdog[5668]: [chrony-healthcheck][ERROR] Failed to trigger NTP sync

In the case of the test passing, I never see that message about a disabled command socket

cc @alexgg @jakogut

jellyfish-bot commented 2 years ago

[rcooke-warwick] on a side note, does anyone know if the rockpi led flashing works? I saw this issue: https://github.com/balena-os/balena-radxa/issues/10 and also I checked supervisor.conf on the rockpi and it has LED_FILE=/dev/null#

acostach commented 2 years ago

@rcooke-warwick the LED is not implemented for the radxa-zero nor rockpi4b so this test can be skipped

jellyfish-bot commented 2 years ago

^ @acostach @floion added this finding to this issue ^ https://github.com/balena-os/balena-radxa/issues/10 -- this is currently causing the rockpi4b to fail the OS test suite. Does this device have an LED? The contract says it does

jellyfish-bot commented 2 years ago

@rcooke-warwick do we have a mechanism to mark a test as not mandatory on a per DT basis?

jellyfish-bot commented 2 years ago

@acostach the test runs because in the contract for the rockpi , it is set to LED: true - should I make the PR for the contract to set this to false?

jellyfish-bot commented 2 years ago

@rcooke-warwick yes, let's set it to false. Related thread https://jel.ly.fish/issue-release-rockpi-4b-rk3399-production-automated-tests-6594ff9

jellyfish-bot commented 2 years ago

I pushed the PR and it should merge soon @rcooke-warwick https://github.com/balena-io/contracts/pull/326

jellyfish-bot commented 2 years ago

done, it's merged in the contracts @rcooke-warwick

jellyfish-bot commented 2 years ago

thanks @acostach I'll now bump contracts in leviathan - which will autobump in meta balena--- eventually it will reach the rockpi repo ;P

jellyfish-bot commented 2 years ago

@acostach @alexgg looks like rockpi4b can now pass the entire test suite: https://jenkins.product-os.io/job/leviathan-v2-template/5140/

Although it looks like the tests won't run on balena-radxa PR's. I'll fix that and then technically we can use Alexes workflow to autodeploy for rockpi if tests pass

jellyfish-bot commented 2 years ago

[rcooke-warwick] I can see you've added the rockpro64 into the rig - does this device successfully flash with the testbot?

jellyfish-bot commented 2 years ago

[rcooke-warwick] was balena-radxa called balena-rockpi until recently?

acostach commented 2 years ago

@rcooke-warwick yes, looks like it was renamed from balena-rockpi to balena-radxa Regarding the rockpro64, it was added to the rig but it didn't get to the flashing step yet, the leviathan job stops during initialization https://jenkins.product-os.io/job/leviathan-v2-template/5148/console

Some more ethernet switches and cables have been ordered and are on their way here, currently the the RockPro64 in the rig is not connected via ethernet.