PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.7k stars 908 forks source link

aarch64 distribution #8655

Open nvisser opened 4 years ago

nvisser commented 4 years ago

Short description

Currently there seems to be no distribution of dnsdist for the aarch64 platform.

Usecase

Running dnsdist on modern ARM aarch64 based hardware.

Description

Modern ARM based platforms are based on aarch64 so having the ability to use dnsdist (or any other powerdns program, really) on these platforms without having to spend a long time compiling or fiddling with cross-compiling would be ideal.

Habbie commented 4 years ago

4663 requests this for 'debian', closing that one as this ticket asks it more widely.

Are there specific distributions for which you'd like this?

dkowis commented 4 years ago

I'd like to see it for Armbian. Armbian seems to be super useful when installing to pine64's. It'd also be handy for building an aarch64 docker container for similar purposes.

kpfleming commented 4 years ago

I'd use this too, I've switched my Raspberry Pis over to plain Debian and the aarch64 kernel (arm64 Debian package flavor). If there's anything I can do to help with this let me know.

Habbie commented 4 years ago

I understand that https://drone.io/ has arm64 CI runners. Perhaps it makes sense to delegate arm64 builds for a few distributions to them, as they have ARM64 hardware. We do not, which means our arm builds (like the Raspbian builds we do today) run under QEMU which is very slow.

I'd welcome a PR for doing this with Drone as an experiment :)

kpfleming commented 4 years ago

Travis-CI has aarch64 support available in beta form: https://docs.travis-ci.com/user/multi-cpu-architectures/

Using that will be tremendously easier than trying to use any other CI platform, given how much work has gone into build-travis.sh.

arthurpetitpierre commented 4 years ago

Travis-CI can now build on Graviton2/arm64 instances: https://blog.travis-ci.com/2020-09-11-arm-on-aws There's a program to get credits for open-source projects on AWS: https://pages.awscloud.com/AWS-Credits-for-Open-Source-Projects , supporting such a build pipeline would be a perfect use-case.

If a PR would help to get that bootstrapped, please let me know.

kpfleming commented 4 years ago

I took a look at the pipeline for building packages on Travis-Ci, and it's definitely non-trivial. Most importantly it has an assumption of just one architecture, so adding a second one will require teaching which target distributions are supported on each architecture in the build matrix. At a minimum CentOS 6 and Ubuntu 16.04 are unlikely to be usable on aarch64 (or even desired).

pieterlexis commented 4 years ago

I took a look at the pipeline for building packages on Travis-Ci, and it's definitely non-trivial. Most importantly it has an assumption of just one architecture, so adding a second one will require teaching which target distributions are supported on each architecture in the build matrix.

We are planning to look at GH actions (no promises though), perhaps this might make things easy

kpfleming commented 4 years ago

I don't believe GitHub Actions offers native aarch64 builds at this time; it can be done via QEMU, but Travis-CI offers native builds.

kpfleming commented 4 years ago

I've taken a look at the build process, and it seems relatively understandable (although it's fairly complex!). The one thing I can't seem to find is the script which actually iterates over all the targets in builder-support/Dockerfiles and executes the builds (and then extracts the resulting packages from the images). This script will need to understand the list of targets which are available on each architecture.

Habbie commented 4 years ago

If you init & update git submodules, builder/build.sh appears. It is called like builder/build.sh centos-7 or builder/build.sh -m authoritative debian-buster etc.

It probably makes sense to hardcode the list of aarch64 targets (probably way shorter than our full list) instead of trying to iterate over a directory listing.

kpfleming commented 4 years ago

Yep, I found that part, I was just wondering what actually invokes build.sh for each of the available targets :-) It doesn't seem to be in the .travis.yml or .circleci configurations. The tool that is doing the iterating is the one that will have to be taught which distro/arch combinations are valid.

Habbie commented 4 years ago

We don't build packages on Travis or CircleCI currently. The only place -we- call build.sh is in the configs for https://builder.powerdns.com/, so it makes sense that you could not find that :)

Habbie commented 4 years ago

In other words, you'll have to write that five line shell script.

kpfleming commented 4 years ago

Got it... then I can propose a new script which is aware of the host architecture and chooses the targets to build for it, and you can decide how to integrate that into the real build process.

Habbie commented 4 years ago

Yes - I see two open questions there: (1) where do we put the packages after Travis has built them (2) do we trust Travis enough to make these packages 'official' and sign them with a pdns key

Habbie commented 4 years ago

By the way, we have no need (or use) for a script around build.sh for amd64, because we list those targets in our buildbot config already. So feel free to underengineer the script.

kpfleming commented 4 years ago

Another option is to leverage the "AWS for Open Source" link above and get AWS aarch64 compute resources that builder.powerdns.com can use.

In either case I can get to the point where I can prove that the existing build processes run properly on an aarch64 machine and produce usable packages for most of the distros that are in the list today.

Habbie commented 4 years ago

Oh! I missed that comment! Indeed that would also make sense, but we wouldn't get to it soon. Getting packages out of Travis would still be a great start.

kpfleming commented 4 years ago

Well, at least some simple testing produced good results:

The build ran to completion with no errors visible. One small issue: the default build uses only one CPU, which is somewhat annoying when you are running builds manually :-) Adding an appropriate DEB_BUILD_OPTIONS in build-debs.sh solves that problem, although based on the debhelper documentation that should not be necessary... This could be an issue if the jobs are run in Travis-CI because they'll use more wall-clock time than necessary, and the jobs may hit the maximum time limit.

Unsurprisingly, builder/build.sh -m recursor raspbian-buster also works just fine, and of course no QEMU is required.

Habbie commented 4 years ago

Unsurprisingly, builder/build.sh -m recursor raspbian-buster also works just fine, and of course no QEMU is required.

That is not entirely unsurprising! When we tested on aarch64 last year, we had a box with -no- arm 32 bit support. So this is excellent news!

Habbie commented 4 years ago

One small issue: the default build uses only one CPU, which is somewhat annoying when you are running builds manually

I noticed the same last year, but I did not dig in to find the -right- solution.

kpfleming commented 4 years ago

One small issue: the default build uses only one CPU, which is somewhat annoying when you are running builds manually

I noticed the same last year, but I did not dig in to find the -right- solution.

My hack of a fix was to set DEB_BUILD_OPTIONS='parallel=4' in the script line which calls fakeroot. Clearly this is suboptimal as it should be configurable (or at least default to the number of CPU cores on the build machine), and based on the debhelper documentation shouldn't even be necessary as if debian/compat is set to 10 or higher (it is for recursor and dnsdist, but not yet for authoritative) and the version requirement for debhelper is set to 10 or higher in debian/control (it is for recursor and dnsdist, but not yet for authoritative), then parallel building is supposed to be the default.

I've got some debhelper-knowledgeable colleagues at $dayjob so I'll ask them for guidance on that front.

kpfleming commented 4 years ago

First build failure: building the authoritative packages from the 4.3.0 tag produced some test failures.

testrunner: ../ext/luawrapper/include/LuaContext.hpp:107: LuaContext::LuaContext(bool)::<lambda(lua_State*)>: Assertion `false && "lua_atpanic triggered"' failed.
unknown location(0): fatal error: in "lua_auth4_cc/test_prequery": signal: SIGABRT (application abort requested)
test-lua_auth4_cc.cc(20): last checkpoint: "test_prequery" test entry
testrunner: ../ext/luawrapper/include/LuaContext.hpp:107: LuaContext::LuaContext(bool)::<lambda(lua_State*)>: Assertion `false && "lua_atpanic triggered"' failed.
unknown location(0): fatal error: in "lua_auth4_cc/test_updatePolicy": signal: SIGABRT (application abort requested)
test-lua_auth4_cc.cc(47): last checkpoint: "test_updatePolicy" test entry
Habbie commented 4 years ago

ah yes, luajit is broken on aarch64. We have workarounds in https://github.com/PowerDNS/pdns/pull/6512 but they are not acceptable for general consumption (i.e. they might create slowdowns for other architectures).

Cleanest would probably be to build against lua 5.3 instead.

kpfleming commented 4 years ago

Confirmed; switching to liblua5.3 allows the build to complete and the tests to pass. This means we'll end up having different Debian configuration files (at least) for amd64 and aarch64 I suppose.

I've also apparently succeeded in getting parallel builds to work using the documented mechanism (at least for versions of Debian which support debhelper 10.x and higher), but I'll not yet claim success there until I've tested it with dnsdist and recursor too :-)

Habbie commented 4 years ago

This means we'll end up having different Debian configuration files (at least) for amd64 and aarch64 I suppose.

I'm sure we can do something more clever than that :)

kpfleming commented 4 years ago

This means we'll end up having different Debian configuration files (at least) for amd64 and aarch64 I suppose.

I'm sure we can do something more clever than that :)

Indeed, I've got this working now, where luajit is used for amd64, and lua5.3 is used for non-amd64. This could be changed to use luajit on non-arm64, and lua5.3 on arm64, quite easily.

kpfleming commented 4 years ago

Current status:

Two test machines -

With a small set of changes in the builder-support tree, these are the results of builds for various distributions.

I did not test any older distros because they are either past their EoL or do not have arm64 support.

At this point the only distro where arm64 fails but amd64 succeeds is CentOS 7, so I'll try to figure out the cause of that. After that I'll send a PR with the various changes to the builder-support tree.

rgacogne commented 4 years ago

fails on arm64 with an error about finding Boost context library

When boost::context is not available or usable we are supposed to fall back to ucontext, so we likely have a detection issue here.

Habbie commented 4 years ago

Amazon Linux 2 - fails on both

You can ignore this one.

zeha commented 4 years ago

Debian + Raspbian stretch can probably also go away from the list, as they are more or less EoL.

zeha commented 4 years ago

fails on arm64 with an error about finding Boost context library

When boost::context is not available or usable we are supposed to fall back to ucontext, so we likely have a detection issue here.

As @Habbie pointed out to me, dnsdist doesn't actually use context. I don't see where it's configure would check for boost::context either.

kpfleming commented 4 years ago

This build failure is for recursor, not dnsdist. I'm building all three in these tests, not just dnsdist.

kpfleming commented 4 years ago

OK... here's the issue. With the version of Boost in CentOS 7, boost::context is installed and is a version sufficiently high for the configure.ac script to want to use it, but building a program on aarch64 results in an error that the 'platform is not supported'. This happens too late for the configure script to fall back to ucontexts.

kpfleming commented 4 years ago

We could identify the version of Boost where aarch64 is supported in boost::context and increase the minimum version required in the configure script.

kpfleming commented 4 years ago

It looks like aarch64 support in boost::context was added in Boost 1.61, which was released more than four years ago. I'll test with the requirement changed to 1.61 to force a fallback to ucontext unless Boost is at least that version.

zeha commented 4 years ago

1.61 appears right; Debian carried a patch for it in 1.58 to 1.60.

pieterlexis commented 4 years ago

EPEL7 ships boost 1.69, we could do the same build-trick we did for EL6 (using EPEL boost) for aarch64 on EL7 as well....

kpfleming commented 4 years ago

Thankfully the RPM build process doesn't need 'convincing' to use all four cores like the DPKG build process did :-)

kpfleming commented 4 years ago

It appears that only recursordist suffers from this problem; pdns also uses boost::context, but uses different logic for selecting when to use it or not, and apparently chooses not to use it on CentOS 7 aarch64.

With the minimum version set to 1.61, I now have a successful build of all three on CentOS 7 aarch64; if someone wants to point me to the 'EPEL trick' I can try to apply it here so that all CentOS 7 packages use boost::context. It would be weird if the amd64 builds used boost::context and the aarch64 builds used ucontexts...

pieterlexis commented 4 years ago

It appears that only recursordist suffers from this problem; pdns also uses boost::context, but uses different logic for selecting when to use it or not, and apparently chooses not to use it on CentOS 7 aarch64.

pdns uses boost, but not boost::context. The files are a bit mixed in the repo (hence the symlink party in recursordist and dnsdistdist).

With the minimum version set to 1.61, I now have a successful build of all three on CentOS 7 aarch64; if someone wants to point me to the 'EPEL trick' I can try to apply it here so that all CentOS 7 packages use boost::context. It would be weird if the amd64 builds used boost::context and the aarch64 builds used ucontexts...

trick is shown here and here

arthurpetitpierre commented 4 years ago

Amazing job kpfleming. For information CentOS 7 / aarch64 isn't supported on Neoverse N1 based aarch64 processors (including Graviton2), and there's no plan to backport support for it. It runs, but there are several issues that aren't fixed and won't be. Starting with RHEL8.2 / CentOS8.2, everything is fine.

Habbie commented 4 years ago

I'm very much okay with opening pdns up to aarch64 only on the most current distributions.

kpfleming commented 4 years ago

If I can do this without too much effort I'll continue with the CentOS 7 support, as there are weird people running such systems on non-Neoverse processors :)

arthurpetitpierre commented 4 years ago

Yes, of course there are ! And it is a good thing. Just wanting to be sure that none would read that and decide to run pdns on CentOS7/neoverse N1 as a consequence and hit some issues.

kpfleming commented 4 years ago

Ahh, well... hmm. The EPEL page says that EPEL-7 is no longer available for aarch64. That means we either drop CentOS 7 from the aarch64 package list, or we allow the fallback to ucontexts.

At this point, I think we could just go with this distro list for aarch64:

If the AWS crew want to work on support for Amazon Linux 2 they are certainly welcome to do so!

The aarch64 builders can also be used for the Raspbian Stretch and Buster packages, but those produce armhf packages, so aarch64 is not a concern there.

If this plan works for the maintainers, I'll start cleaning up my branch of changes for the builder-support tree and get a PR opened.

Habbie commented 4 years ago

If this plan works for the maintainers, I'll start cleaning up my branch of changes for the builder-support tree and get a PR opened.

Yes please!

andypost commented 4 years ago

Alpinelinux is building dnsdist for 7 arches https://pkgs.alpinelinux.org/packages?name=dnsdist&branch=edge

Latest update to 1.5.1 shows few failed tests
test-suite.log {{{
     1  =====================================
     2     dnsdist 1.5.1: ./test-suite.log
     3  =====================================
     4  
     5  # TOTAL: 1
     6  # PASS:  0
     7  # SKIP:  0
     8  # XFAIL: 0
     9  # FAIL:  1
    10  # XPASS: 0
    11  # ERROR: 0
    12  
    13  .. contents:: :depth: 2
    14  
    15  FAIL: testrunner
    16  ================
    17  
    18  Running 93 test cases...
    19  unknown location(0): ^[[4;31;49mfatal error: in "dnsdistlbpolicies/test_lua": LuaContext::ExecutionErrorException: bad light userdata pointer^[[0;39;49m
    20  test-dnsdistlbpolicies_cc.cc(455): ^[[1;36;49mlast checkpoint: "test_lua" test entry^[[0;39;49m
    21  unknown location(0): ^[[4;31;49mfatal error: in "dnsdistlbpolicies/test_lua_ffi_rr": LuaContext::ExecutionErrorException: bad light userdata pointer^[[0;39;49m
    22  test-dnsdistlbpolicies_cc.cc(511): ^[[1;36;49mlast checkpoint: "test_lua_ffi_rr" test entry^[[0;39;49m
    23  unknown location(0): ^[[4;31;49mfatal error: in "dnsdistlbpolicies/test_lua_ffi_hashed": LuaContext::ExecutionErrorException: bad light userdata pointer^[[0;39;49m
    24  test-dnsdistlbpolicies_cc.cc(569): ^[[1;36;49mlast checkpoint: "test_lua_ffi_hashed" test entry^[[0;39;49m
    25  unknown location(0): ^[[4;31;49mfatal error: in "dnsdistlbpolicies/test_lua_ffi_whashed": LuaContext::ExecutionErrorException: bad light userdata pointer^[[0;39;49m
    26  test-dnsdistlbpolicies_cc.cc(626): ^[[1;36;49mlast checkpoint: "test_lua_ffi_whashed" test entry^[[0;39;49m
    27  unknown location(0): ^[[4;31;49mfatal error: in "dnsdistlbpolicies/test_lua_ffi_chashed": LuaContext::ExecutionErrorException: bad light userdata pointer^[[0;39;49m
    28  test-dnsdistlbpolicies_cc.cc(681): ^[[1;36;49mlast checkpoint: "test_lua_ffi_chashed" test entry^[[0;39;49m
    29  
    30  ^[[1;31;49m*** 5 failures are detected in the test module "unit"
    31  ^[[0;39;49mFAIL testrunner (exit status: 201)
    32  
}}} test-suite.log
kpfleming commented 4 years ago

That looks like the aarch64/luajit problem.