Closed Crushable1278 closed 4 months ago
Yep, official buster support ends by end of this month.
Currently the build works anyway, it waits a little for timeout then continues as usual.
I guess the freexian repository would be in similar magnitude as the Debian? So something like multiple of hundreds of GB? Not easy to host.
The least invasive workaround is to use hosts file to rewrite the internal domain to local nginx that would serve local or proxy remote repository, perhaps even cache the results as workaround for repeated pulls?
Changing the value isn't really solution since someone may want to use other mirror rather than whatever we choose. Thus ideally we would want to configure this. We could add extra option for the configure script to allow adding additional repositories and add filter that everything from local.deb.vyos.io domain is automatically excluded.
The freexian buster repository isn't as huge (~20GB) but that's still quite wasteful to copy and then use only fraction of it. Thus I think using caching proxy is much better option for both us and the upstream repository. This way only used packages will be fetched and repeated fetch would be served from local cache - this will speedup repeated build and avoid wasting the bandwidth of freexian mirror.
apt-cacher-ng
seems like nice and appropriate tool for such job. Requires no configuration and handles the caching automatically.
apt install apt-cacher-ng
Now we can use this kind of http://<host>:3142/<repository URL without protocol>
format to specify what repository we want.
Thus resulting in this kind of definition:
deb http://172.17.17.17:3142/deb.freexian.com/extended-lts buster main contrib non-free
I like it, it can't be easier than this!
On the vyos side we need to make patch for the ./configure
(build-config) script to allow us to specify what ELTS mirror we want to use. Something like --debian-elts-mirror http://172.17.17.17:3142/deb.freexian.com/extended-lts
and also add filter that will remove any definitions with local.deb.vyos.io
domain from additional_repositories
if --debian-elts-mirror
is specified. This is already implemented via https://github.com/dd010101/vyos-build/commit/6e45e5102aa8ade167d9a445912a33936a0a7961.
I did also update the readme to include the information about and the usage of ELTS repository for the equuleus.
Thus this issue is resolved? What do you think? @Crushable1278
I like the idea, I just have one question about apt-cacher-ng.
If version 1 of package X is cached, and version 2 is released, will the client download version 2, or just see that version 1 is there, and use that?
Or in other words, does the cacher look upstream for newer versions?
You can implement apt cache easily - there are specific files that you don't want to cache, since they tell you what packages exist and thus if package was updated, this information is stored in the Packages/Release/InRelease and such index files. The apt-cacher-ng understands these kind of files (apt-cacher-ng calls them volatile) and it has special handling for those. The simple solution would be just never cache index files but apt-cacher-ng is smarter and will cache them but it will always check if never version exists - this is done so you can use the cache even if upstream repository is currently unavailable.
The index files tell you what repository has and thus if something was updated. This will tell you what version (file name and thus URL) currently package has. Thus if package is updated, then index file changes, the file name changes, the URL changes, thus the apt client will see new package and will fetch package that doesn't exist in cache and thus it will be added into the cache.
The Packages file contains list of all packages including their name Filename: pool/main/p/python-defaults/python_2.7.16-1_amd64.deb
. Thus proxy doesn't care what exists, doesn't care if something changes, doesn't need do any checks for new packages - it just doesn't cache those special index files, this allows the apt client to know if something changed and where the new file is and the apt client itself will tell the cache if something new needs to be cached.
This way the cache has automatically the old and any new version and thus the last piece is to expire old versions so you don't pile them up indefinitely.
The apt-cacher-ng is dedicated proxy for apt, thus it solves everything for you automatically and has sensible default settings. You don't need to worry about this unless you want to implement your own proxy or use generic proxy software.
Alternatively, if you want to mirror only the ELTS packages yourself (which I wouldn't recommend, use @dd010101's method instead):
debmirror -a amd64 -d buster-lts -s main,contrib,non-free --nosource --ignore-release-gpg \
-h deb.freexian.com -r extended-lts --method=http \
--progress /var/www/html/deb.freexian.com/extended-lts \
--exclude='(-dbg_|-dbg-)' --i18n
Notice that they warn about expiring GPG keys in the archive.debian.org buster repository as time progresses. Therefore you should then mirror the entire thing, i.e. replace -d buster-lts
by -d buster
(all packages, debian + ELTS), as they'll resign them (i guess ...).
The first leads to roughly 7GB download and the latter to 70GB.
EDIT: For anyone else reading this, note that the changes made to the data/defaults.json
file, where the ELTS repo is added, are integrated at build time into the vyos-build docker image. So recreate that first, before you build anything else.
It seems they've frozen the code for 1.3.8. Has anyone built it yet? I'm getting ready to do that now ...
Alternatively, if you want to mirror only the ELTS packages yourself (which I wouldn't recommend, use @dd010101's method instead):
That's for sure not recommended since if everyone pulls and keeps whole ELTS mirror updated then that's major waste of upstream bandwidth and it only grows as time progresses. The VyOS project uses only part of what Debian has to offer so the delta between ELTS and the VyOS uses is significant. That's why we want to download only what VyOS uses and not everything from ELTS.
Therefore you should then mirror the entire thing, i.e. replace -d buster-lts by -d buster (all packages, debian + ELTS)
The VyOS team already uses their whole repository, not just LTS section. How does that work? If you have two repositories with same priority and same package version? Apt picks repository at random, does round-robin or what? 😄
If they use the whole freexian repository then they should drop the archive?
EDIT: For anyone else reading this, note that the changes made to the data/defaults.json file, where the ELTS repo is added, are integrated at build time into the vyos-build docker image. So recreate that first, before you build anything else.
Use proper workaround instead - https://github.com/dd010101/vyos-build/commit/6e45e5102aa8ade167d9a445912a33936a0a7961. They should implement this and not use hardcoded internal domain but oh well..
apt-cacher-ng seems like a fantastic solution for this use case. Excellent improvement to the build script, dd.
EDIT: For anyone else reading this, note that the changes made to the data/defaults.json file, where the ELTS repo is added, are integrated at build time into the vyos-build docker image. So recreate that first, before you build anything else.
What do you mean by this? You're saying that changing data/defaults.json
prior to building the image is not enough?
It seems they've frozen the code for 1.3.8. Has anyone built it yet?
I have built and am running 1.3.8. I passed make test
, make testc
, and make testraid
, but find that my make testd
fails in a particular place because, I believe, the test case needs to be fixed and is therefore a false failure.
I can't access http://local.deb.vyos.io/extended-lts
, can you?
docker-vyos/Dockerfile:COPY [ "data/defaults.json", "data/live-build-config/archives/*", "docker-vyos/vyos_install_common.sh", "docker-vyos/vyos_install_stage_01.sh", "/tmp/" ]
The Dockerfile copies the file with the ELTS repository url to the /tmp directory of the build environment. Then
docker-vyos/vyos_install_common.sh: local APT_ADDITIONAL_REPOS=`jq --raw-output .additional_repositories[] /tmp/defaults.json`
and finally in that same file the ELTS repo string is written to the vyos-build docker image sources.list
file.
echo -e "deb ${APT_VYOS_MIRROR} ${APT_VYOS_BRANCH} main\n${APT_ADDITIONAL_REPOS}" > /etc/apt/sources.list.d/vyos.list
That makes it integrated at build time from my point of view.
If you can't access their ELTS mirror, then if you are using their vyos-build image, it won't either. In consequence you have to modify the build environment first, assuming you are using the vyos-build docker image.
The path to the ELTS repo has to be accessible from within said docker image, i.e. if you use a 127.0.0.0/8 address that is inaccessible from within said docker image, then it won't be able to access the ELTS repo either. Correct?
The VyOS team already uses their whole repository, not just LTS section. How does that work? If you have two repositories with same priority and same package version? Apt picks repository at random, does round-robin or what?
I'm replacing their url with my own. They are using the full freexian repo, so unless they did something wrong, it shouldn't conflict with archives.debian.org
. However, I haven't looked any further into it yet.
At least that is my logic of how this works. Please correct me if I'm wrong.
That makes it integrated at build time from my point of view.
Yes, it's integrated into the docker container itself, so yes, over time, without maintaining a custom Equuleus build image, the docker build image packages would become out of date, but that has no impact on retrieving packages for the .iso build. Whether or not this impacts compilation of packages, I can't say.
I'm not well versed in the distribution of docker images, but wouldn't this also relate more to running apt-get install
or upgrade inside the build container as the base system layers, including updates from Freexian at time of build, would be captured and published to docker hub?
I currently see 4 packages available for update inside my Equuleus container across all my machines, presumably updated in the last ~25 days, since the last Equuleus publish to docker hub.
Whether or not this impacts compilation of packages, I can't say.
I don't know, at least not with any certainty. I only know that vyos added it to their vyos-build Dockerfile, so they are certainly using it when building both packages and the ISO. If they'll update it on dockerhub after 1.3.8 is out, I don't know. My guess is as they only tagged 1.3.8 in the last 2-3 days or so, they'll wait till it is stable before pushing. But what do I know.
In such cases I do perhaps err on the side of caution and just rebuild it all, thus avoiding potential problems. If it is really necessary, I can't say.
I'm not well versed in the distribution of docker images, but wouldn't this also relate more to running apt-get install or upgrade inside the build container as the base system layers, including updates from Freexian at time of build, would be captured and published to docker hub?
From what I understand, if I want to build the ISO from within the vyos-build docker image, which is what I do, then I need to make sure that the build environment is able to access the ELTS repo, so that the ISO can be built with these packages available to the apt-get install commands that are part of the ISO build process.
Perhaps you aren't using the vyos-build docker image in that way, then please ignore me ...
Don't know if it is related, but I can't build the wide-dhcpv6 package (for equuleus) anymore (vyos-build/packages/wide-dhcpv6). They haven't touched its definition in two years, but it built fine for me a month ago. So something must have changed between then and now. Luckily I have the previous package still, so that isn't a problem for me at the moment.
Check out #23 and see if that helps.
Check out https://github.com/dd010101/vyos-jenkins/issues/23 and see if that helps.
Thank you!
Off-topic! The ELTS is solved right? If yes please close.
@pittagurneyi
The Dockerfile copies the file with the ELTS repository url to the /tmp directory of the build environment
You are looking at docker-vyos, that's vyos in docker container, not the docker build image.
I think we are talking about the docker build image right?
@Crushable1278
Yes, it's integrated into the docker container itself...
ELTS or the defaults.json aren't part of the docker build image. That's what you mount when you running docker run. Thus the defaults.json is taken from whatever repository you are running the docker run from. Thus modifying defaults.json before build will work as expected. The docker run mounts your repository, then you run ./configure (build-config), ./configure loads default.json and combines it with your ./configure arguments and this produces build/build-config.json and this then continues eventually into sources.list.d - it's all runtime.
...over time, without maintaining a custom Equuleus build image, the docker build image packages would become out of date, but that has no impact on retrieving packages for the .iso build. Whether or not this impacts compilation of packages, I can't say.
Not because the ELTS but yes. The docker build image is pretty much part of the build process. It has elements that will directly change the resulting ISO, thus out of date docker build image can successfully make broken ISO. Although most likely failure-mode is that the outdated docker build image will fail the build.
That's why I added automated docker build image build - so people don't need to think about it - Jenkins periodically checks if the definition changed and rebuilds the image and pushes the image into local registry, so the docker build image is in theory always up to date.
Many ways to do this. Before I did have simple script that compared git hash of vyos-build from previous run and did rebuilt the docker build image if hash changed before the ISO build, not the most efficient way but simple. Either way you should think about the docker build image as part of the build - so reusing older image is bad. The dockerhub version did brake the build so many times for me that I ignore it's existence.
EDIT: As @dd010101 pointed out, I was wrong here, as I accidentally assumed that they'd integrated ELTS repos into the build process of the vyos-build docker image already, but for whatever reason they didn't. vyos-build/docker-vyos/Dockerfile
is not the Dockerfile for the vyos-build docker image, but vyos-build/docker/Dockerfile
is. The latter makes no use of the additional_repositories
as defined in data/defaults.json
it seems ...
I've looked into it a bit more. The vyos-build image (created by vyos-build/docker-vyos/Dockerfile, henceforth called vyos-build docker image or vyos-build image) only uses the additional repositories during its build, but removes the sources.list file that contains them afterwards. So the final vyos-build docker image may have been built using said repos, but they aren't part of the build environment afterwards, which is a strange choice if you ask me.
vyos-build/docker-vyos/vyos_install_common.sh
is used as a library and its functions relevant to the repos are:
/etc/apt/sources.list.d/vyos.list
/etc/apt/sources.list.d/vyos.list
The following scripts that are part of the vyos-build docker image build process each invoke both:
The only other script that accesses the additional repos seems to be:
scripts/build-config:args['custom_apt_entry'] = args['custom_apt_entry'] + build_defaults['additional_repositories']
It is only called by ./configure
in vyos-build git repo. args['custom_apt_entry']
is used in scripts/live-build-config
, which in turn is invoked by Makefile
.
@dd010101: So for building the ISO your patch works perfectly fine!
I just hadn't expected them to remove the ELTS repository from the vyos-build docker image, because the normal LTS ones are in there:
deb http://deb.debian.org/debian-security buster/updates main
deb http://deb.debian.org/debian buster-updates main
From my point of view it shouldn't have been removed again during the vyos-build docker image build process or added in another way.
All packages that one builds on top of vyos-build won't have access to the ELTS packages should any of them require apt-get install
commands. Would it be possible to get an inconsistent state that way?
The vyos-build image (created by vyos-build/docker-vyos/Dockerfile, henceforth called vyos-build docker image or vyos-build image)
No! The vyos-build/docker-vyos/Dockerfile is VyOS inside container as router. The docker build image is vyos-build/docker/Dockerfile. It's confusing, you need to look carefully. There are two different vyos docker images - one is the vyos as router, the other is the vyos-build.
All packages that one builds on top of vyos-build won't have access to the ELTS packages should any of them require apt-get install commands. Would it be possible to get an inconsistent state that way?
That's good point - the support for the Debian buster as in docker build image will soon end the same way as it ends for the Debian as in vyos ISO. The priority is for sure on different level but the Debian in docker should have the ELTS as well. I won't create inconsistency directly per say but the build tools can have security issues the same way as any other package can, thus you don't want to use unsupported OS even as the build tool.
No! The vyos-build/docker-vyos/Dockerfile is VyOS inside container as router.
Thank you, my bad. That means the ELTS repos aren't used anywhere in the build process of the vyos-build docker image or any packages built on top of it yet ...
I just did a grep
of additional_repositories
and saw the docker-vyos/*
files that used it and thought, okay there they are integrating the ELTS repo ...
It's used inside the live-build chroot. Thus the Debian that will become vyos ISO uses packages from ELTS. Yet the build OS for itself doesn't.
The ./configure creates build/build-config.json, this contains the ELTS (via custom_apt_entry, merged with additional_repositories).
That's then reused by live-build-config into sources.list.
Thank you for linking the points where it is used.
Do you see any elegant way to add the ELTS repo to the vyos-build Dockerfile?
At the moment it seems we'd have to do add something like this:
RUN echo "deb http://deb.freexian.com/extended-lts buster main contrib non-free" > /etc/apt/sources.list.d/freexian-lts.list
As the apt cache wouldn't be available within the vyos-build docker image, we'd have to either point it to deb.freexian.com
or ensure the apt cache is available via proper DNS server entry inside the local network.
Whether we have to disable the original /etc/apt/sources.list
with content
deb http://deb.debian.org/debian buster main
deb http://deb.debian.org/debian-security buster/updates main
deb http://deb.debian.org/debian buster-updates main
I don't know. That is what you referred to regarding the round-robin/priority problem. At least when it comes to their docker-vyos/Dockerfile
they aren't deactivating them as far as I can see (vyos_install_common.sh
).
The docker image doesn't have any other mechanism than "just do it" as you show. I don't like the hardcoded freexian mirror, since this is prime example how to really waste bandwidth - it will fetch package each time you build the image and that will be repeating event - the same as with the ISO build.
That's why I would suggest doing the same thing as in the case of ISO build. We want to implement configurable ELTS mirror, so you can provide your own - then you can use the same apt-cacher-ng method and share the cache with the ISO build - very good for you and the upstream as well.
You can use your apt-cacher-ng with ease - just provide the configuration option for the docker builder, of course you can't use localhost but you can use internally addressable IP like the dummy 172.17.17.17 as we use, or any other you have laying around, like the IP of docker interface.
I made patch for the vyos-build/docker/Dockerfile for this to be possible.
You need to include additional argument --build-arg ELTS_MIRROR=...
with your ELTS mirror when you want to build your docker build image, like so:
docker build --build-arg ELTS_MIRROR=http://172.17.17.17:3142/deb.freexian.com/extended-lts \
--no-cache -t vyos/vyos-build:equuleus .
This will upgrade the docker image and install all future packages with the ELTS presence.
Whether we have to disable the original /etc/apt/sources.list with content
The same issue exists with the live-build chroot (future VyOS), where it has:
# grep -r "deb " /etc/apt/sources*
/etc/apt/sources.list:deb http://archive.debian.org/debian buster main contrib non-free
/etc/apt/sources.list:deb http://deb.debian.org/debian-security buster/updates main contrib non-free
/etc/apt/sources.list:deb http://archive.debian.org/debian buster-updates main contrib non-free
/etc/apt/sources.list:deb http://archive.debian.org/debian buster-backports main contrib non-free
/etc/apt/sources.list.d/vyos.list:deb http://172.17.17.17/equuleus equuleus main
/etc/apt/sources.list.d/bullseye.list:deb http://deb.debian.org/debian/ bullseye main
/etc/apt/sources.list.d/custom.list:deb [arch=amd64] https://repo.saltproject.io/py3/debian/10/amd64/3003 buster main
/etc/apt/sources.list.d/custom.list:deb [arch=amd64] http://repo.powerdns.com/debian buster-rec-48 main
/etc/apt/sources.list.d/custom.list:deb http://172.17.17.17:3142/deb.freexian.com/extended-lts buster main contrib non-free
The docker build image has (as you post):
# grep -rE "^deb" /etc/apt/sources*
/etc/apt/sources.list:deb http://deb.debian.org/debian buster main
/etc/apt/sources.list:deb http://deb.debian.org/debian-security buster/updates main
/etc/apt/sources.list:deb http://deb.debian.org/debian buster-updates main
If priority and version is the same then Debian prefers the main sources.list so archive.debian.org wins versus ELTS in sources.list.d. Its goes by alphabetical order, the main sources.list first, then the lists in sources.list.d/*.list.
Thus you can: 1) increase priority 2) remove main Debian mirror from sources.list 3) inject ELTS as first option in sources.list
The first option can break things - don't mess with priority of you don't need to... The second option works but I think the third option is better, since then the apt will go to ELTS first and if it's not there then it tries the Debian mirrors.
But there is option 4) make the Debian key never expire. Then apt will use archive as preference if ELTS has the same thing. Sounds like best option? Since we won't waste the ELTS bandwidth if the archive has it.
Does the Debian key in vyos-build and/or live-build chroot has infinite expiration? Nope:
# gpg --list-packets --verbose /etc/apt/trusted.gpg.d/debian-archive-buster-automatic.gpg | grep expire
version 4, algo 1, created 1555228135, expires 0
hashed subpkt 9 len 4 (key expires after 8y0d0h0m)
version 4, algo 1, created 1555228135, expires 0
hashed subpkt 9 len 4 (key expires after 8y0d0h0m)
It will expire in 2032 - perhaps this is not pressing issue as of now... 😄
That's why I don't think we need to think about the archive vs ELTS aspect for now...
I don't like the hardcoded freexian mirror, since this is prime example how to really waste bandwidth - it will fetch package each time you build the image and that will be repeating event - the same as with the ISO build.
I agree with you. Going forward I'll configure the cache, I only built the ISO once today, so I'm okay with it this once.
--build-arg ELTS_MIRROR=
As I use Jenkins to build the vyos-build docker image, for me it doesn't really matter whether I hardcode it myself in the Dockerfile or the Jenkinsfile ... but the idea is a good one in general.
At the moment the Dockerfile for vyos-build does not yet use http://archive.debian.org/...
, correct? So a change there is necessary as well I guess. For 1.3.9 they'll probably adjust it.
It will expire in 2032 - perhaps this is not pressing issue as of now... 😄
Well if I'd known that I wouldn't have even mentioned it ;)
It says beware of it on the freexian website in the instructions.
As I use Jenkins to build the vyos-build docker image, for me it doesn't really matter whether I hardcode it myself in the Dockerfile or the Jenkinsfile ... but the idea is a good one in general.
If you have personal fork then yea but I need to think more universally since this project is mainly for others not for myself. I think I will solve the Jenkins case with Jenkins build parameter with http://172.17.17.17:3142/deb.freexian.com/extended-lts
as fallback - this will allow you to use your own mirror via Jenkins build parameter or just use what we use if you have the same dummy0 setup. I made patch so this is possible now.
Looks like this:
At the moment the Dockerfile for vyos-build does not yet use http://archive.debian.org/..., correct?
No it doesn't. This is based on the official buster dockerhub image, so this is more question for them more than VyOS team or us. I will wait and see.
It will expire in 2032 - perhaps this is not pressing issue as of now... 😄
Well if I'd known that I wouldn't have even mentioned it ;) It says beware of it on the freexian website in the instructions.
Well... I can't read the gpg output correctly...
# gpg --list-packets --verbose /etc/apt/trusted.gpg.d/debian-archive-buster-automatic.gpg | grep expire
version 4, algo 1, created 1555228135, expires 0
hashed subpkt 9 len 4 (key expires after 8y0d0h0m)
version 4, algo 1, created 1555228135, expires 0
hashed subpkt 9 len 4 (key expires after 8y0d0h0m)
This says 8 years from 1555228135. Thus April 2027 still not tomorrow but sooner than 2032!
apt-key list is easier way of seeing the expiration of all loaded keys:
If we are already talking about it - we can as well solve the issue! The expiration workaround mentioned by the freexian documentation seems like straightforward deal. We can use script like the gpgvnoexpkeysig
that will translate EXPKEYSIG into GOODSIG thus expired key is taken as good key.
It's easy to install as well:
wget "https://salsa.debian.org/debian/mmdebstrap/-/raw/master/gpgvnoexpkeysig" -O /usr/libexec/gpgvnoexpkeysig
chmod +x /usr/libexec/gpgvnoexpkeysig
You can use it manually:
apt-get -o Apt::Key::gpgvcommand=/usr/libexec/gpgvnoexpkeysig update
You may as well use it globally:
echo 'Apt::Key::gpgvcommand "/usr/libexec/gpgvnoexpkeysig";' > /etc/apt/apt.conf.d/01gpgvnoexpkeysig
This way there will be no pesky KEYEXPIRED from apt!
I made patch for the docker build image as well as for the live-build chroot to use gpgvnoexpkeysig.
With all of the above - both the docker build image as well as the live-build chroot (future VyOS) use ELTS packages cached via apt-chacher-ng but prefer to use Debian repositories first. This way we don't waste the ELTS bandwidth and grow the cache for packages that Debian has. The issue of future expiration of Debian archive keys (or any key actually) is being ignored thanks to gpgvnoexpkeysig. This is the most efficient way to keep the ELTS bandwidth and cache size at minimum.
Wonderful work! Thank you for looking into it and solving any remaining issues.
I'm working on lots of other things at the moment and won't have time to test it for quite a while. I also remember that I promised you I'd look at the scripts, etc., but I just haven't had the time yet :/ Sorry about that.
Should you encounter any issues, would you mind leaving a quick note here, so I get notified via email and know about it for when I'll need to apply the necessary modifications to my repo?
If I remember correctly, with your approach, we could restrict the repository to buster-lts
, correct? The freexian documentation said only if we had key problems would we need the full repo available, i.e. buster
.
So this should be sufficient?!
# no key problems, only ELTS packages
deb http://deb.freexian.com/extended-lts buster-lts main contrib non-free
# full repo with all debian packages included
#deb http://deb.freexian.com/extended-lts buster main contrib non-free
I also remember that I promised you I'd look at the scripts, etc., but I just haven't had the time yet :/ Sorry about that.
It doesn't matter since the scripts are already outdated/superseded anyway! The automation got huge leap forward thanks to work on install-scripts by GurliGebis - there will be script to automate the whole Jenkins / build / ISO process soon and thus the current/old scripts aren't important anymore.
Should you encounter any issues, would you mind leaving a quick note here, so I get notified via email and know about it for when I'll need to apply the necessary modifications to my repo?
Sure, if I don't forget!
If I remember correctly, with your approach, we could restrict the repository to buster-lts, correct? The freexian documentation said only if we had key problems would we need the full repo available, i.e. buster.
Yes, it's enough, if you use the whole buster - it doesn't matter - same result either way - Debian mirror will have preference and ELTS will be used only if has something newer, thanks to the order of the repositories in sources.list. I used what I used just because I copied that parameters from the VyOS definition - it doesn't really have purpose and it doesn't really matter unless your change the order of the deb definitions then it will start to matter but doesn't matter as it is now.
BTW
I have built and am running 1.3.8. I passed make test, make testc, and make testraid, but find that my make testd fails in a particular place because, I believe, the test case needs to be fixed and is therefore a false failure.
The make testd is strange case...
It is supposed to test configuration with the vyos-configd.service (the code) enabled (as opposite to disabled by default), this service is used as IPC proxy to allow reused of single python interpret that has already state loaded and thus reducing the overhead of loading the state again and again of each block (script) of particular commit, thus improving the commit speed and mainly the many block commit - like the boot.
The test isn't testing anything specific - it's just enables or disables the service. The service sets shim that then configuration scripts use, thus if service runs then it sets shim so configuration scripts use the proxy otherwise the shim is empty thus proxy isn't used.
Yet the service/shim is activated in the final VyOS image, so why would you want to test without it? I have no idea... This option with/without made sense in past, when the --configd was new and thus wasn't used in the release yet now it doesn't make sense to test without, to match what the releases uses... That's just some background on what the --confgd is and thus what make testd does...
How can --configd break tests then? It shouldn't - it's running the same scripts just in different way... The salt-minion test that fails (even for me) tests the default config - where the configuration writes socket.gethostname() as default into the config file and the test compares it with socket.gethostname(). Thus socket.gethostname() != socket.gethostname()
how is that possible? Well it needs to have some relation to the fact, that --configd executes the configuration via the service/proxy yet the test runs outside but how can configuration see socket.gethostname() == "debian"
and the test see socket.gethostname() == "vyos"
? That's the strange part... I see that the make testd shows vyos hostname as in the bash prompt so this matches with what test sees but the service sees debian so...?
Interesting to note - the Jenskinsfile/automated ISO build doesn't use make testd (only make test). Although as I see it since the vyos-configd.service is enabled by default then it doesn't make sense to test without it and the opposite is true as well - you don't want make test since that's not how VyOS runs in wild. Yet that's exactly what the Jenskinsfile does!
As I see it the service is broken in some way, not the tests and they are running the wrong automated tests, that's why they don't know about it? 😄 I don't know if this is equuleus specific issue, they don't run automated builds for equuleus but they run them for rolling and the rolling tests without --configd as well. The rolling github action also doesn't use make testd.
As I see it the service is broken in some way, not the tests and they are running the wrong automated tests, that's why they don't know about it? 😄 I don't know if this is equuleus specific issue, they don't run automated builds for equuleus but they run them for rolling and the rolling tests without --configd as well.
From what you wrote, I'd bet that is just something else they haven't really thought about in quite a while.
Regarding the socket.gethostname()
:
https://askubuntu.com/questions/1464419/python-socket-gethostname-data-source:
I'd expect Python's socket.gethostname() to be just a thin wrapper around the gethostname system call, which on Ubuntu (using glibc) is in turn a wrapper for the uname system call
https://linux.die.net/man/2/sethostname:
The GNU C library does not employ the gethostname() system call; instead, it implements gethostname() as a library function that calls uname(2) and copies up to len bytes from the returned nodename field into name.
https://unix.stackexchange.com/questions/164089/where-does-hostname-store-the-hostname-that-ive-set:
Take a look at this related U&L Q&A titled: Where does uname get its information from?. Information such as the hostname persists within a data structure within the Linux kernel, while the system is running. During a system's boot this information can be reattained through a variety of mechanisms that is typically distro specific.
So are the two started at two different points in time and one inherits a different environment ?!
I'd assume that the data structure within the Linux kernel is always the same and a copy of it doesn't get inherited at process start time, but maybe that is what is going on?
Note: I haven't yet looked at make testd/--configd
.
I mean...
Python:
#!/usr/bin/env python3
import socket
import time
while True:
print(socket.gethostname(), flush=True)
time.sleep(10)
Systemd:
[Service]
Type=simple
ExecStart=/root/hostname.py
[Install]
WantedBy=multi-user.target
Result:
# systemctl status hostname.py.service
● hostname.py.service
Loaded: loaded (/etc/systemd/system/hostname.py.service; disabled; preset: enabled)
Active: active (running) since Mon 2024-06-24 13:43:26 CEST; 22s ago
Main PID: 1100 (python3)
Tasks: 1 (limit: 9434)
Memory: 4.0M
CPU: 17ms
CGroup: /system.slice/hostname.py.service
└─1100 python3 /root/hostname.py
Jun 24 13:43:26 bookworm systemd[1]: Started hostname.py.service.
Jun 24 13:43:27 bookworm hostname.py[1100]: bookworm
Jun 24 13:43:37 bookworm hostname.py[1100]: bookworm
Jun 24 13:43:47 hello hostname.py[1100]: hello
Running python inside systemd service sees new hostname changed during runtime right away.
Also the boot hostname:
cat /mnt/squashfs/etc/hostname
localhost.localdomain
Isn't debian either as I understand the OS boots with localhost.localdomain and then the configuration switches to whatever config.boot has - vyos by default.
Like I said, I haven't really looked at what it does. That was just a guess ... either someone hardcoded it somewhere, or something else is going on. Otherwise I have no idea how the error message should ever occur. Especially "debian", it has to come from somewhere.
Linux kernel CONFIG_DEFAULT_HOSTNAME ?
# grep CONFIG_DEFAULT_HOSTNAME= /mnt/squashfs/boot/config-5.4.268-amd64-vyos
CONFIG_DEFAULT_HOSTNAME="(none)"
Yep it needs to but from where?
strace
it? Sorry, I'm knee deep in other stuff and don't have time to look into it ... perhaps later.
EDIT: As it seems to be retrieved via glibc, does debian modify the library and hardcode a default in case (none) is set by the kernel?
EDIT2: https://packages.debian.org/buster/glibc-source ?! I can't find it in there and I looked at http://security.debian.org/debian-security/pool/updates/main/g/glibc/glibc_2.28-10+deb10u3.debian.tar.xz.
It's a trip, because if I boot the ISO that smoketest uses and call what the smoketest calls, via the vyos-configd proxy then I see vyos there :).
So other test running before salt minion have influence on this?
LOL
We have two options (I think):
It seems to require a more in-depth analysis of what exactly the environments look like and what kind of commands are run in the test.
Now they are raising the memory limit for the smoketest as well: https://github.com/vyos/vyos-build/pull/659/commits/7057a25a2c0297245294d7d5c71e79ac09e02e94
I randomly managed to reproduce it on first boot, second boot didn't fail (since it has already hostname set) and it seems like there is issue with the running process of vyos-configd. It sees the wrong hostname, on plain bookworm debian this isn't issue but here it is for whatever reason. Thus you can restart the vyos-configd after first boot and this fixes the issue. I don't think you can do much else about it.
If here was restart instead of start it would work I think. The service is running by default thus the start doesn't do anything. This points to bigger issue though - if the smoketest has difficulties then the VyOS has the same difficulties when hostname changes before you do reboot right? Like when you change the hostname then the configuration still uses the boot hostname until you do reboot. Thus ideally you should restart the configd service when you change hostname but there is the trap - the service that changes the hostname needs the restart... Service that restarts itself? I don't think that's legal since finishing code needs to run after the hostname code runs...
Now they are raising the memory limit for the smoketest as well: https://github.com/vyos/vyos-build/commit/7057a25a2c0297245294d7d5c71e79ac09e02e94
I know and they didn't understand it https://github.com/vyos/vyos-build/pull/656#pullrequestreview-2117787251. I wonder how many bugs we discovered they have yet to discover 😄. Only if they didn't exclude their community from their forum they could save themselves some work by us going trough all the build/testing steps they didn't go through for a while - in some cases for years...
EDIT: yep, the restart instead of start in the check-qemu-install fixes the smoketest
DEBUG - OK
DEBUG - Running Testcase: /usr/libexec/vyos/tests/smoke/cli/test_service_salt.py
DEBUG - test_default (__main__.TestServiceSALT) ... ok
DEBUG - test_options (__main__.TestServiceSALT) ... ok
DEBUG -
I can also confirm that changing the start to restart on Equuleus fixes the test_service_salt failure for me. Interestingly, an unmodified check-qemu-install works fine with Sagitta.
Yet again, you've saved our bacon with fixing a silly bug. Thank you.
For me using anything but LTS is no option, ever. Even though I'd be using it in my homelab, I would depend on it working 24/7. I have no time to go bug hunting, etc. when I update it. My homelab already requires far too much maintenance work, which only continues to increase every year. So VyOS "just" has to work. For me, that is what LTS with extended testing is all about. Current is worthless to me for my use case, unless I want to contribute a feature. I'd go so far as to ignore the first few point releases of a new major release, because I don't have the time for it.
Yeah, I don't get the idea that unstable router is fine if you don't have large revenue either. If your router stops working on reboot then even at home you are limited in your capacity and time to resolve such issues, you need to setup alternative internet access to help you, inter-vlan traffic stops working, etc... You will not likely loose revenue because of it but it's still bad experience...
Router is appliance that should just work - that's the single part in any network you really don't want to debug. Fortunately VyOS has the option to rollback via GRUB - that will save your bacon in such situations but that's still not experience you want to have...
I did make little change for check-qemu-install (not confusing name at all!) to easily bootstrap into the same environment that automated smoketest uses. This allows to call scripts/check-qemu-install --configd --debug --uefi build/live-image-amd64.hybrid.iso --sandbox
(same command as Makefile uses for test, just append --sandbox
, the test options don't do anything - this skips all tests) and this gives me console access after preparations so I can mess about in the same environment that smoketest fails in. Simply poweroff to exit.
I still see one issue with make testd
(equuleus) where test_interfaces_bonding
fails on test_dhcp_vrf
(inherited from base_interfaces_test
). Specifically the ip vrf pids
returns PID that doesn't correspond to any running dhclient. Do you see this as well?
DEBUG - Running Testcase: /usr/libexec/vyos/tests/smoke/cli/test_interfaces_bonding.py
DEBUG - test_add_multiple_ip_addresses (__main__.BondingInterfaceTest) ... ok
DEBUG - test_add_single_ip_address (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_hash_policy (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_lacp_rate (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_mii_monitoring_interval (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_min_links (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_multi_use_member (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_remove_member (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_source_bridge_interface (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_source_interface (__main__.BondingInterfaceTest) ... ok
DEBUG - test_bonding_uniq_member_description (__main__.BondingInterfaceTest) ... ok
DEBUG - test_dhcp_disable_interface (__main__.BondingInterfaceTest) ... ok
DEBUG - test_dhcp_vrf (__main__.BondingInterfaceTest) ... FAIL
DEBUG - test_dhcpv6_client_options (__main__.BondingInterfaceTest) ... ok
DEBUG - test_dhcpv6_vrf (__main__.BondingInterfaceTest) ... ok
DEBUG - test_dhcpv6pd_auto_sla_id (__main__.BondingInterfaceTest) ... ok
DEBUG - test_dhcpv6pd_manual_sla_id (__main__.BondingInterfaceTest) ... ok
DEBUG - test_interface_description (__main__.BondingInterfaceTest) ... ok
DEBUG - test_interface_disable (__main__.BondingInterfaceTest) ... ok
DEBUG - test_interface_ip_options (__main__.BondingInterfaceTest) ... ok
DEBUG - test_interface_ipv6_options (__main__.BondingInterfaceTest) ... ok
DEBUG - test_interface_mtu (__main__.BondingInterfaceTest) ... ok
DEBUG - test_ipv6_link_local_address (__main__.BondingInterfaceTest) ... ok
DEBUG - test_mtu_1200_no_ipv6_interface (__main__.BondingInterfaceTest) ... ok
DEBUG - test_span_mirror (__main__.BondingInterfaceTest) ... ok
DEBUG - test_vif_8021q_interfaces (__main__.BondingInterfaceTest) ... ok
DEBUG - test_vif_8021q_lower_up_down (__main__.BondingInterfaceTest) ... ok
DEBUG - test_vif_8021q_mtu_limits (__main__.BondingInterfaceTest) ... ok
DEBUG - test_vif_s_8021ad_vlan_interfaces (__main__.BondingInterfaceTest) ... ok
DEBUG -
DEBUG - ======================================================================
DEBUG - FAIL: test_dhcp_vrf (__main__.BondingInterfaceTest)
DEBUG - ----------------------------------------------------------------------
DEBUG - Traceback (most recent call last):
DEBUG - File "/usr/libexec/vyos/tests/smoke/cli/base_interfaces_test.py", line 160, in test_dhcp_vrf
DEBUG - self.assertIn(str(dhclient_pid), vrf_pids)
DEBUG - AssertionError: '18749' not found in '18969 dhclient\n18999 dhclient-script\n19400 dhclient-script\n19401 ps\n19402 awk'
My notes regarding that were:
Happens sometimes, reruns typically have no issues.
DEBUG - test_dhcp_vrf (main.BondingInterfaceTest) ... FAIL
https://vyos.dev/T6025 T6025 smoketest: bonding: test fails every once in a while
EDIT: Only saw it on equuleus, not sagitta ...
EDIT2: If I remember correctly, then I spent roughly an hour investigating, but couldn't figure out why the pid is sometimes arleady gone when it tests whether it is running.
EDIT3: This occured for me with the normal smoketests, i.e. make test
by Jenkinsfile.
@dd010101 is there anything inside the jenkins configs that should be changed wrt. this? (Just so I can update my installer scripts) - I hope to be able to work on it this week, but I'm quiet busy 😊
@GurliGebis Not Jenkins (Jenkins just pulls updated source from GIT with the changes included) but look for ELTS mentions in the readme. There is new apt install dependency, the docker build has additional argument and the ISO configure has additional argument too. The extra arguments are for equuleus-only.
@pittagurneyi Overall the interface tests are flaky - smells like race condition of some kind... I run the test many times and saw various different failures that resolve themselves magically. Specifically I saw how tearDown randomly failed because dhclient was left behind delete interface - I can explain this by the fact that dhclient can exist in brief period after you ask it to exit - so in this case there cound be race condition between when dhclient actually exits versus when the smoketest checks it's existence.
I did manage to reproduce the dhclient PID missing with --sandbox:
FAIL: test_dhcp_vrf (__main__.BondingInterfaceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/libexec/vyos/tests/smoke/cli/base_interfaces_test.py", line 160, in test_dhcp_vrf
self.assertIn(str(dhclient_pid), vrf_pids)
AssertionError: '3988' not found in '4021 dhclient\n 4041 dhclient-script\n 4447 dhclient-script\n 4448 ps\n 4449 dhclient-script'
This is what I see in the journal around the dhclient PID 3988 of interest - this is shared log of everything:
There is huge mess of PID file X does not exists, killing dhclient with SIGTERM signal
that's bit siscpious The 02-vyos-stopdhclient: line 34: kill: (X) - No such process
is also weird - that's all normal? 😄 I don't think so - since if the test passes I don't see any of those errors.
Another run:
AssertionError: '5763' not found in '5817 dhclient\n 5826 dhclient-script\n 6619 dhclient-script\n 6623 ps\n 6624 awk'
PID of interest 5763
bondX: can't get flags: No such device
doesn't sound healthy at all... What is going on? I don't know... The successful run doesn't have any of those so I don't think this is normal behaviour... Does it try to start dhclient on non-existing interface? That would explain why dhclient doesn't want to exist but I don't see any dhclient errors...
Added an install dependency is really easy 😊
@dd010101 Like we both said, there is a race condition, or from what I saw until now, probably many, and whoever wrote the test scripts didn't include the necessary commands to wait/check for process shutdown before moving on.
EDIT: That includes every action taken that results in an interface or in general system state change. If no one verifies that the desired state is actually achieved before trying to take action on it, one always gets an indeterminate state.
The dhclient PID is more complicated though...
dhclient_pid = process_named_running(dhclient_process_name, cmdline=interface)
self.assertTrue(dhclient_pid)
# .. inside the appropriate VRF instance
vrf_pids = cmd(f'ip vrf pids {vrf_name}')
self.assertIn(str(dhclient_pid), vrf_pids)
From here.
It successfully finds a dhclient PID with interface name in it's cmdline. Thus the dhclient is running yet not in the vrf - how that could happen? I see only two ways:
1) The dhclient is the from previous test. Yet there is a test to verify this cannot happen, if this did happen, the previous test would fail. 2) How is possible to configure interface, under vrf with dhcp, where dhcp isn't running under vrf? This needs to bug in the configuration interface of VyOS - where some race condition activates a bug, where VyOS doesn't correctly handle dhclient and silently fails to run in it under specified vrf...? Thus the test is correct...?
I don't see a way how the failure of the test would make the dhclient mismatch happen. The test tells VyOS to spawn dhclient in specified vrf and the VyOS spawns it but no under the vrf? I don't see how the test could cause this - it's race condition but no in the test...
I think we need to go back a step and see how they even check for running processes.
For equuleus the function process_named_running
is defined here: https://github.com/vyos/vyos-1x/blob/equuleus/python/vyos/util.py#L467
For sagitta it is now here: https://github.com/vyos/vyos-1x/blob/sagitta/python/vyos/utils/process.py#L207
They changed the process name comparison from
p.info['name'] == name
to
name in p.info['name']
I'd prefer the former to be honest. Otherwise you could have two process names where one is a subset of the other and get false positives.
But as the equuleus one is what I'd choose, this shouldn't be our problem.
This here is also not done in a good way if you ask me:
self.assertIn(str(dhclient_pid), vrf_pids)
Instead of creating a list of pids and testing if the dhclient_pid
is part of said list, they are comparing strings and if the dhclient_pid
is by chance (may be slim) a subset of another pid, you'd get false positives again.
I'm not familiar enough with psutil to say if this could be a problem or not:
https://psutil.readthedocs.io/en/latest/#psutil.process_iter
Every Process instance is only created once, and then cached for the next time psutil.process_iter() is called (if PID is still alive). Cache can optionally be cleared via process_iter.clear_cache().
In the test they are running:
self.cli_commit()
# Validate interface state
for interface in self._interfaces:
. . .
dhclient_pid = process_named_running(dhclient_process_name, cmdline=interface)
self.assertTrue(dhclient_pid)
# .. inside the appropriate VRF instance
vrf_pids = cmd(f'ip vrf pids {vrf_name}')
self.assertIn(str(dhclient_pid), vrf_pids)
Isn't that already a race condition? They are setting CLI settings, including start of dhclient, say, "commit", and directly start testing if it already succeeded. Isn't it nearly guaranteed, that everything is "in flux" while they test?
Do you know if self.cli_commit()
waits for all tasks to be done and checks if the desired state is achieved, before it returns?
EDIT:
The test tells VyOS to spawn dhclient in specified vrf and the VyOS spawns it but no under the vrf?
But does dhclient actually spawn in the specified vrf, or does linux create the process and then move it to a namespace? I mean internally how this is achieved.
EDIT2: What happens if you "just" put a sleep 60 seconds or something like it after the commit?
It's not easy!
But as the equuleus one is what I'd choose, this shouldn't be our problem.
Yep, dhclient is dhclient and also the test successfully finds other dhclients so process match is not a problem and the wildcard comparison is not better as you say.
Instead of creating a list of pids and testing if the dhclient_pid is part of said list, they are comparing strings and if the dhclient_pid is by chance (may be slim) a subset of another pid, you'd get false positives again.
They are actually doing what you suggesting - the assertIn
compares collection, thus "PID" in ["PID1", "PID2"]
, they just have PIDs as strings, that's why you see string conversion, it's still PID == PID comparison in result, just happens their collection of PID has string types. This is not problem either since the assert says what it compares towards and the PID isn't there... Thus this match also works correctly.
I'm not familiar enough with psutil to say if this could be a problem or not:
It could fail, throw exception or not find anything - I don't think it's likely it doesn't find just the one thing you looking for but finds others - that's very unlikely failure mode.
Isn't that already a race condition? They are setting CLI settings, including start of dhclient, say, "commit", and directly start testing if it already succeeded. Isn't it nearly guaranteed, that everything is "in flux" while they test?
The dhclient is managed via systemd:
/lib/systemd/system/dhclient@.service
[Unit]
Description=DHCP client on %i
Documentation=man:dhclient(8)
ConditionPathExists=/var/lib/dhcp/dhclient_%i.conf
ConditionPathExists=/var/lib/dhcp/dhclient_%i.options
After=vyos-router.service
[Service]
WorkingDirectory=/var/lib/dhcp
Type=exec
EnvironmentFile=-/var/lib/dhcp/dhclient_%i.options
PIDFile=/var/lib/dhcp/dhclient_%i.pid
ExecStart=/sbin/dhclient -4 $DHCLIENT_OPTS
ExecStop=/sbin/dhclient -4 $DHCLIENT_OPTS -r
Restart=always
[Install]
WantedBy=multi-user.target
Thus vyos calls systemctl start or systemctl stop on services managed by this template to start/stop dhclient. The dhclient has -d, so it runs in foreground.
The stop race condition is real and I did see it to cause errors - you asking the process to exit but it could wait for IO and thus if you are fast enough you will see process still running after you call stop. Since the process that manages stopping exits independently on the main process. I would say this is bug of the test thus sleep here is justified.
I'm not familiar with Type=exec but it should act as Type=simple from start point of view.
The exec type is similar to simple, but the service manager will consider the unit started immediately after the main service binary has been executed.
Thus I believe and my experience says the systemctl start will block you before the executable runs, thus in the moment the start returns to you - you should already have process running. Do you see race condition here? I don't.
Also the test does finds the process thus it was started and it's running with correct cmdline but not in vrf.
Do you know if self.cli_commit() waits for all tasks to be done and checks if the desired state is achieved, before it returns?
I mean it does what it does. It doesn't intentionally run things asynchronously. It doesn't check anything - it just applies whatever you configured. It's possible the commit exists before everything is done - for sure but not in sense that configuration isn't done. Like if it starts service then the service may not be ready yet but the configuration should be right?
But does dhclient actually spawn in the specified vrf, or does linux create the process and then move it to a namespace? I mean internally how this is achieved.
You launch the executable via vrf exec like ip vrf exec THE_VRF dhclient ...
. They are launching the executable via systemd thus they need to override the template to allow this and that's what they do - they use this override template to override the template to prefix the dhclient with ip vrf exec THE_VRF
that's done before systemctl start dhclient.
EDIT2: What happens if you "just" put a sleep 60 seconds or something like it after the commit?
I know how to solve it - you just retry to find the PID and you will find it eventually:
import time
retries = 3
while True:
try:
dhclient_pid = process_named_running(dhclient_process_name, cmdline=interface)
self.assertTrue(dhclient_pid)
# .. inside the appropriate VRF instance
vrf_pids = cmd(f'ip vrf pids {vrf_name}')
self.assertIn(str(dhclient_pid), vrf_pids)
# and the commandline has the appropriate options
cmdline = read_file(f'/proc/{dhclient_pid}/cmdline')
self.assertIn('-e\x00IF_METRIC=210', cmdline) # 210 is the default value
break
except AssertionError:
retries -= 1
if retries <= 0:
raise
time.sleep(5)
This seems to be enough, second try does find it.
I don't like to solve issues like this though - the systemctl stop is clear to see but the start I don't see where is the race condition and you don't want to fix the test so much it stops to see actual errors... If you know the race condition and you know it doesn't affect real world use-cases then you could implement assert with retry for the test where you comment about why it's justified. But I feel like this is easily abused in way that will hide actual errors.
If the commit really needs wait or retry then the commit should do it for the specific tasks where race condition exists right? You can always write scripts for real use-case where you will see these race conditions after commit so adding workaround for race conditions of commit in the tests isn't really solution.
I don't like sleep - if you added 60 seconds after each commit - you would wait really sweet time for all those commits to complete. I think even sleep 5 is way too much slow down.
BTW: /usr/libexec/vyos/tests/smoke/cli/test_interfaces_bonding.py BondingInterfaceTest.test_dhcp_vrf
fails eventually too, in multiple ways actually, including the dhclient PID mismatch so it's just the single test alone that triggers this, you don't need the other tests in the bonding suite.
I don't like sleep - if you added 60 seconds after each commit - you would wait really sweet time for all those commits to complete. I think even sleep 5 is way too much slow down.
I just meant to figure out if it would solve it, not saying we should solve it that way ...
My question was, does in fact the commit command not wait until everything is finished configuring and returns beforehand?
This seems to be where it is defined ?! https://github.com/vyos/vyos-1x/blob/equuleus/smoketest/scripts/cli/base_vyostest_shim.py#L73C1-L77C29
def cli_commit(self):
self._session.commit()
# during a commit there is a process opening commit_lock, and run() returns 0
while run(f'sudo lsof | grep -q {commit_lock}') == 0:
sleep(0.250)
ConfigSession(os.getpid())
seems to be the type of object used for self._session
. https://github.com/vyos/vyos-1x/blob/equuleus/python/vyos/configsession.py#L88
Its commit()
: https://github.com/vyos/vyos-1x/blob/equuleus/python/vyos/configsession.py#L164C1-L166C19
It runs /opt/vyatta/sbin/my_commit
via __run_command
: https://github.com/vyos/vyos-1x/blob/equuleus/python/vyos/configsession.py#L131C1-L138C22
That is waiting for the completion of the command. What does my_commit
do?
In the end it would perhaps still be a good idea to add a sleep 60 seconds and see if you can or cannot reproduce the race condition any longer. If not, then whatever is in flux due to the commit
has not settled and the race condition originates from some process not finishing before the commit
function returns, apparently successfully ?!
Is the return code actually checked? Or could it fail and the test doesn't notice?
My question was, does in fact the commit command not wait until everything is finished configuring and returns beforehand?
It's slow as hell - it needs to do something right?
That is waiting for the completion of the command. What does my_commit do?
The my_commit is THE COMMIT, if commit releases lock before things are done, then it will break any script, including the tests or production scripts. Thus the test shouldn't try to fix this by implementing workarounds.
Binary defined here. Callback here here. And this leads to doCommit() in commit-algorithm.cp.
It does wait for commit to complete - so yes it does wait and thus when commit releases lock then it should be done.
Thus the test shouldn't try to fix this by implementing workarounds.
I'm not really familiar with the internals of VyOS. I also said I don't want to implement workarounds. You read that right? All I wanted to test was whether whatever is invoked in consequence to the commit command does, contrary to what you and I would expect, not finish before returning.
Does perhaps the exec
systemd start type not wait for the modified ExecStart commands to finish in a way that we would need? I don't think dhclient
forks, but does ip vrf ...
do so? Is perhaps dhclient
then a child process of ip
and we would perhaps need type simple instead?
https://github.com/vyos/vyos-1x/blob/equuleus/data/templates/dhcp-client/override.conf.j2
I mean this here:
ExecStart={{ vrf_command }}/sbin/dhclient -4 {{ dhclient_options }} {{ ifname }}
vyos/vyos-build@ee9b7396b2d83d19a782186369427aded05c2b82R14
This upstream change introduces a host name dependency on
local.deb.vyos.io
.Perhaps two ways to solve:
Freexian does request you mirror their ELTS repository should you hit it frequently.