Closed stgraber closed 4 years ago
Hi, another UT Austin student checking in. I'd like claim this issue.
@DBaum1 assigned to you now, thanks!
I have been having trouble running the tests for lxd.
My information: OS: Ubuntu 18.04 Kernel: 4.15.0-69-generic LXD version: 3.18 LXC version: 3.18 I have successfully generated the lxc and lxd binaries.
Whenever I try to run integration tests with sudo -E ./main.sh
from the test directory, I receive the message Missing dependencies: lxd lxc
Running sudo -E make check
from the repository root it fails with
--- PASS: TestVersionTestSuite/TestString (0.00s)
PASS
ok github.com/lxc/lxd/shared/version (cached)
? github.com/lxc/lxd/test/deps [no test files]
? github.com/lxc/lxd/test/macaroon-identity [no test files]
Makefile:142: recipe for target 'check' failed
make: *** [check] Error 1
I have not made any modification to lxd source, this is a clean copy.
You might need to add ~/go/bin to your $PATH. At least that's what I have.
sudo
can be a weird beast, sudo -E abc
doesn't act as the same as running sudo -E -s
and then running abc
, at least it doesn't for me.
I usually run sudo -E -s
and then run make check
from there. It also makes it easier for you to check if the environment variables are indeed properly applied.
Thank you for your suggestions, I am now able to run the tests. However, I am still experiencing problems with sudo -E ./main.sh
since the result is:
==> TEST DONE: static analysis
==> Test result: failure
I also get these warnings:
WARN[11-13|22:23:32] Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored.
WARN[11-13|22:23:32] CGroup memory swap accounting is disabled, swap limits will be ignored.
And notifications that various things are undefined:
# _/home/dinah/lxd/lxd/db_test
lxd/db/containers_test.go:453:14: tx.Tx undefined (type *db.ClusterTx has no field or method Tx, but does have db.tx)
lxd/db/migration_test.go:124:38: too many errors
...
lxd/cluster/gateway_test.go:130:17: undefined: cluster.TLSClientConfig
lxd/cluster/gateway_test.go:160:23: gateway.RaftNodes undefined (type *cluster.Gateway has no field or method RaftNodes)
lxd/cluster/heartbeat_test.go:150:43: target.Cert undefined (type *cluster.Gateway has no field or method Cert, but does have cluster.cert)
lxd/cluster/heartbeat_test.go:164:28: gateway.IsLeader undefined (type *cluster.Gateway has no field or method IsLeader, but does have cluster.isLeader)
lxd/cluster/raft_test.go:20:19: too many errors
# _/home/dinah/lxd/lxd/db/query_test
lxd/db/query/dump_test.go:54:12: undefined: query.DumpParseSchema
lxd/db/query/dump_test.go:56:15: undefined: query.DumpTable
lxd/db/query/dump_test.go:169:19: undefined: query.DumpSchemaTable
# _/home/dinah/lxd/lxd/endpoints_test
lxd/endpoints/cluster_test.go:18:30: endpoints.Up undefined (type *endpoints.Endpoints has no field or method Up, but does have endpoints.up)
lxd/endpoints/cluster_test.go:20:40: not enough arguments in call to httpGetOverTLSSocket
lxd/endpoints/cluster_test.go:20:50: endpoints.NetworkAddressAndCert undefined (type *endpoints.Endpoints has no field or method NetworkAddressAndCert)
Ah yeah, this tends to happen if you're not storing your code in ~/go/src/github.com/lxc/lxd
, Go is weirdly picky about it.
In most cases, it's really much easier to just give up on storing things where you want and instead just having your working copy of LXD be at the expected spot in the Go PATH (~/go/src/github.com/lxc/lxd) and running tests from there too.
Thank you, I am now running it from ~/go/src/github.com/lxc/lxd
and the Go PATH is /home/dinah/go
I have also added ~/go/bin
to my $PATH, and that solved the problems with undefined things. However I am still getting an error from
ok github.com/lxc/lxd/shared (cached)
? github.com/lxc/lxd/shared/api [no test files]
? github.com/lxc/lxd/shared/cancel [no test files]
? github.com/lxc/lxd/shared/cmd [no test files]
? github.com/lxc/lxd/shared/containerwriter [no test files]
? github.com/lxc/lxd/shared/dnsutil [no test files]
? github.com/lxc/lxd/shared/eagain [no test files]
? github.com/lxc/lxd/shared/generate [no test files]
...
? github.com/lxc/lxd/test/deps [no test files]
? github.com/lxc/lxd/test/macaroon-identity [no test files]
make: *** [Makefile:152: check] Error 1
It seems the script is not able to locate some test files, even though I have confirmed that there are files (I guess not test ones?) located at those paths. It also fails TestConvertNetworkConfig, where it receives an "unexpected error: creating the container failed." github.com/lxc/lxd/lxc-to-lxd also fails. Is there a certain lxd configuration I should be using before running the tests?
@DBaum1 can you show the full output for that? What you listed above looks correct other than Makefile failing.
Thanks, I'm getting the same result here. We should probably fix that test to be skipped when non-root.
Anyway, your best bet is to do:
sudo -E -s
make check
Which will then run that test as root and should go fine (it does here anyway).
@stgraber The result is the same when run as root.
That's weird. Anyway, I'd say not worry about that too much, what's more interesting for this change is going to be the system tests.
Do those work for you if you do:
sudo -E -s
cd test
LXD_TMPFS=1 LXD_VERBOSE=1 ./main.sh
That fails as well.
I am not running it in a VM or container.
What OS and kernel is that system running?
The error above looks suspiciously like a kernel bug from the 5.1 kernel.
That might be the problem - I updated my system a couple days ago, so now I'm running Ubuntu 19.04, kernel v5.0.0-36-generic. I can revert to an earlier version.
Hmm, no, that should be fine, it's very similar to what we run on Jenkins actually.
Ah, I wonder if it's just path traversal being a problem. Can you try running chmod +x /home/dinah
see if that takes care of the problem?
chmod +x /home/dinah
seems to help a lot - it is getting past basic usage and failing on container devices - nic - bridged.
Do you have a lxdbr0 bridge on your system? We'd normally expect the testsuite to work without it but it may have regressed in that regard without us noticing.
On the upside you're way past the clustering tests :)
I do not have an lxdbr0 bridge. After some commenting out of tests, I've found that I am able to run all tests except container devices - nic - bridged (as stated above), id mapping, migration, and attaching storage volumes.
I'm not sure how much of a deal breaker not being able to run those tests is.
I'd just like to say thank you so much for your help and responsiveness - my machine has been causing no end of problems.
idmap and attaching storage volumes are both caused by subuid/subgid issues. You can fix that by just deleting /etc/subuid
and /etc/subgid
.
The migration tests are failing because of CRIU being its usual unreliable self, we usually do not have CRIU installed on our test systems, so I'd say, just apt-get remove criu
and that should fix it.
I finally got all the tests to work! Your advice + changing the entry for lxd/root in /etc/sub{g,u}id
to:
root:1000000:1000000000
lxd:1000000:1000000000
fixed everything! Thanks!
Excellent!
Registering the native architecture of each server in the nodes
database table and then running make update-schema
results in make
failing lxd-p2c
with undefined: sqlite3.ErrLocked
and undefined: sqlite3.ErrBusy
It doesn't seem to be the changes I made to the code since running make update-schema
on a clean copy causes make
to fail in the same fashion.
It also causes make lxd-agent
to fail with
test@liopleurodon:~/go/src/github.com/lxc/lxd$ make lxd-agent
go install -v -tags agent ./lxd-agent
github.com/lxc/lxd/lxd/db/cluster
github.com/lxc/lxd/lxd/db
# github.com/lxc/lxd/lxd/db
lxd/db/instances.mapper.go:205:10: undefined: ClusterTx
lxd/db/instances.mapper.go:205:41: undefined: InstanceFilter
lxd/db/instances.mapper.go:205:60: undefined: Instance
make: *** [Makefile:32: lxd-agent] Error 2
Rebase on master, you're missing a small fix I sent yesterday to avoid this.
Hello @DBaum1, I've just pushed a PR that will make implementing this feature a bit easier.
Essentially, I did the ground work for adding a new database table, so now there's a new internal API NodeAddWithArch()
that can be used to inserting a new node into the database specifying its architecture. The architecture must be an integer value among the ones we support, see the shared/osarch/architectures.go
for a detailed list of them.
You'll want to look at the cluster.Accept()
function in lxd/cluster/membership.go
, around line 184. That's the point where NodeWithArch()
is used. At the moment any new node will be accepted using the same architecture of the accepting node (which is already part of the cluster). What needs to be changed is that the joining node should communicate to the accepting node its architecture, so we will add the node to the database using its actual architecture. In order to do that you'll need to extend the REST API adding a new Arch
field to the ClusterPut
structure in shared/api/cluster.go
. The joining node will need to fill the Arch
field when posting the accept request to the accepting node.
Please feel free to ask questions.
@freeekanayaka
I'm looking to clarify a few things:
I've been looking at lxd/api_cluster.go
and it seems that the Arch
field for the ClusterPut
structure would best be filled with ArchitectureGetLocalID()
in either clusterPut()
or clusterPutJoin()
, which joins a new node to an existing cluster.
clusterAcceptMember()
is then invoked and internalClusterPostAccept()
is called before the new node is accepted by the accepting node in cluster.Accept()
I'm also wondering about the best way to communicate the joining node's architecture to the accepting node. cluster.Accept()
's parameters include a pointer to state.State
struct which (among other things) has Node *db.Node
, Cluster *db.Cluster
, and OS *sys.OS
When the Arch
field for the ClusterPut
struct is filled when posting the accept request, is that field accessible (or could be made accessible by adding a helper function) through the db.Node, db.Cluster, or OS? Should I add it as a parameter?
@DBaum1 I think my previous comment was a bit inaccurate.
The new Arch
field should be added to the internalClusterPostAcceptRequest
struct, not to ClusterPut
.
Then inside clusterAcceptMember()
, which is invoked in the joining node, you can call ArchitectureGetLocalID()
to fill the Arch
field of the internalClusterPostAcceptRequest
object that is about to be serialized and sent to the accepting node.
Finally inside internalClusterPostAccept()
, which is invoked by the accepting node, you have to pass req.Arch
as a new parameter of cluster.Accept()
, which in turn will pass it to tx.NodeAddWithArch()
when creating the new node database object.
Hope that's more clear.
Thank you for the clarification!
Hi @stgraber @freeekanayaka
Would it be best to add an option to lxc launch
that allows the user to specify the architecture they want to deploy their image to?
Also, any suggestions for testing this feature?
@DBaum1 there's no need of an additional option, because the architecture is implied by the image you choose. Probably the best way to test it is to use a couple of VMs of different architectures. I don't think we're going to have a way to perform automated testing of this (under test/suite
), so for now we'll be happy with manual testing only.
Right, testing would effectively be:
lxc launch ubuntu:18.04/arm64 c1
lxc launch ubuntu:18.04/arm64 c2
lxc launch ubuntu:18.04/amd64 c3
lxc launch ubuntu:18.04/amd64 c4
lxc launch ubuntu:18.04 c5
lxc launch ubuntu:18.04 c6
Given our normal balance algorithm this should land us with the aarch64 server running c1, c2 and c5 and the x86_64 running c3, c4 and c6.
For your local testing, you may be able to do so by using a cluster made of a normal x86_64 laptop and a raspberry pi or similar ARM development board. If you can't find an aarch64 Ubuntu image for whatever board you have, you could do the same on armv7l by using /armhf
rather than /arm64
.
Alternatively if you have easy access to AWS instances, they provide aarch64 instances on there too which should let you run both an x86_64 and aarch64 instance on the same virtual network and test clustering on there.
Or if neither of those options are readily available to you, just send a pull request and after we've done the first bit of review on it, I'll test it on some hardware I have around here.
@freeekanayaka
Should I use the image alias to extract the image's architecture? An image's architecture doesn't seem to be available until after containersPost()
and NodeWithLeastContainers()
are called but its alias is. My thought was to create a modified version of nodes()
which only returns the cluster nodes of a certain architecture.
@DBaum1 correct, you can use the image alias to get the image's architecture (the image alias is the "name" column in the "images_aliases" table, which also contains an "image_id" pointing to a row in the "images" table, which in turn contains the architecture).
Modifying nodes()
to filter by architecture seems reasonable to me.
Seeking clarification: the images table has been initialized by the time containersPost() is called. So, at that point, I could query the images_aliases table for the architecture with one of the query functions and pass the result to getNodeWithLeastContainers()?
Hmm, so this is a bit more complicated than I expected.
In containersPost
you may not have req.Architecture
set already, in fact, in most cases it will not be set (our CLI never seems to set it).
So I think that your initial implementation was right but we need to make it more global.
We need to remove the logic from containersPost
, only keeping the part which handles an already set targetNode
.
Then add target selection logic to each of:
createFromImage
=> Select based on architecture of the imagecreateFromNone
=> If Architecture in req
, then go there, otherwise pick least occupiedcreateFromMigration
=> This one will have the Architecture in req
, if not, we should failcreateFromCopy
=> Place on the same node as the source (so we don't waste bandwidth and disk)It's not ideal for the createFromImage
case as an ambiguous ubuntu:18.04
will resolve to something different based on the architecture of the server doing the resolving but it will at least make ubuntu:18.04/amd64
and ubuntu:18.04/arm64
work properly.
What we'll eventually want I suspect is an extra simplestreams function which lets us know what architecture a particular alias is valid for, so we could feed it 18.04
and it would give us a slice of architectures, we can then use that to select a suitable node based on least busy. But it's fine to ignore that part for now, it's an optimization we can sort out later.
So Select based on architecture of the image
is a bit of a problem still as you don't know the architecture of the image from req
and don't get to query it until after it's been retrieved and so may end up with an image of the wrong architecture that then needs to be internally transferred to a suitable node.
Focus on the others for now as those should be easy and I'll try to send a branch later today which adds a LXD function you can hit with a req.Source
and be told what architectures are compatible with it.
So the function I'm writing will actually be suitable for being called directly from containersPost
.
It takes a req
and figures out where it can go for all type of sources.
Got everything covered except for the interesting remote images types, will sort that out after lunch.
@DBaum1 I'm not quite done with this yet but I've pushed what I have in a draft PR here: https://github.com/lxc/lxd/pull/6585
This at least shows you the new function that LXD will have, so you can integrate it with your work.
As this isn't used anywhere in LXD, it's quite untested so may well need some fixes before it behaves.
I'm now doing the simplestreams side of this, making sure that for aliases that point to multiple images that I do get a properly filled map. Then I'll port our caching code across and rebase daemon_images.go
to using that simpler client, but none of that should impact you.
I've got the simplestreams side sorted now in my branch, looking at caching next, but that's really just an optimization as far as you're concerned.
I've now ported the built-in cache over to the new mechanism, waiting for jenkins to be happy with my branch then it can land.
LXD clustering can be used to turn multiple LXD servers into one large instance. Right now, this assumes that all servers in the cluster are of the same architecture.
While that's certainly the common case, there are times where it would be useful to have a single LXD cluster which supports multiple architectures, usually a mix of Intel and Arm hardware.
To make this possible, we'd need to:
nodes
database table