cloud-init initalisation

rowlandwatkins commented 7 years ago

Hi folks,

I'm getting some interesting errors on startup on AWS:

[[[ cloud-init failed: connect failed Operation canceled java.so: Starting JVM app using: io/osv/isolated/RunIsolatedJvmApp java.so: Setting Java system classloader to IsolatingOsvSystemClassLoader and logging manager to IsolatingLogManager ]]]

With the following commandline:

[[[ /java.so -Dlogback.configurationFile=/conf/logging/logback.xml -Dconfig.file=/conf/application.conf -cp /lib/* akka.kernel.Main com.blah.Main ]]]

If I include remote JMX remote, I'd get the following:

[[[ Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: : : Name does not resolve ]]]

cloud-init is initialised as:

/init/00-cmdline: [[[ /usr/mgmt/cloud-init.so --force-probe ]]]

I'm wondering if my approach to initialising cloud-init is now out of date... Any tips to getting cloud-init back up?

Cheers,

Rowland

nyh commented 7 years ago

The "Operation Canceled" (ECANCELED) message is fishy, because nothing uses this errno. Looking at modules/cloud-init/client.cc, connect_with_timeout(), maybe this is what we see in case of timeout? Maybe you can add printouts to confirm this. I'm not familiar with the Boost ASIO library to tell.

I seem to remember discussing with @justinc1 something about the initialization order of cloud-init and httpserver - I think this code tries to communicate with httpserver, which might not be alive (yet?). Maybe this has something to do with this?

Can you run the kernel with --verbose (e.g., run.py -V) and maybe see warnings about missing libraries? Does your image include httpserver?

rowlandwatkins commented 7 years ago

@nyh Thanks for the info!

OK, while I'm adding printouts, here's the build settings:

[[[ ./scripts/build image=httpserver,cloud-init,java,cli fs_size_mb=1024 jdkbase=/opt/tools/jdk1.8.0_131 ]]]

So I think httpserver is there, but not sure beyond that. It looks like there's been a load of changes late last year that could have moved things around.

Can I do anything with my AWS kernel to increase verbosity, like from capstan?

Cheers,

Rowland

nyh commented 7 years ago

Yes, it does look like you have httpserver there. https://github.com/cloudius-systems/osv/issues/830 discussed the order issue, but I think it should work - I don't think we changed any of this in the last year.

Regarding the empty host name error you saw - @justinc1 did change this area recently - see for example https://github.com/cloudius-systems/osv/issues/866 - so maybe there's still a bug? OSv has a silly default hostname ("osv") and we shouldn't override it with an empty string (that we supposedly got from cloud-init? or what?).

If you run your kernel locally once with "run.py -V" (before you convert it to the format you need for AWS) it will change the command line in the image to include "--verbose" for you. I am not familiar enough with Capstan to tell you how to do it in Capstan. But anyway, it seems like you do have httpserver there, so you won't see missing libraries or someting like I thought you might see.

justinc1 commented 7 years ago

There was no code commited to "fix" #830 - the problem was only that I didn't understood how can cloud-init use httpserver API, if httpserver is started after cloud-init. Nadav explained in https://github.com/cloudius-systems/osv/issues/830#issuecomment-272814748, that cloud-init links with httpserver library, so it has "embedded httpserver".

Empty hostname: I see two places where cloud-init and/or dhpc uses sethostname: https://github.com/cloudius-systems/osv/blob/master/modules/cloud-init/cloud-init.cc#L29 https://github.com/cloudius-systems/osv/blob/master/core/dhcp.cc#L724 Both do check for string length, so new name should not be empty. But, it could still be invalid DNS name. On second reading of the issue - I guess when @nyh said "empty hostname", he was referring to "Local host name unknown" and "Name does not resolve" error?

rowlandwatkins commented 7 years ago

Hi @nyh ,

The plot thickens - here's output from scripts/run.py -V:

[[[ [osv-daily] Running shell script

./scripts/run.py -V warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] OSv v0.24-362-gf29dba0 4 CPUs detected Firmware vendor: SeaBIOS bsd: initializing - done VFS: mounting ramfs at / VFS: mounting devfs at /dev net: initializing - done vga: Add VGA device instance eth0: ethernet address: 52:54:00:12:34:56 virtio-blk: Add blk device instances 0 as vblk0, devsize=1073741824 random: virtio-rng registered as a source. random: <Software, Yarrow> initialized VFS: unmounting /dev VFS: mounting zfs at /zfs zfs: mounting osv/zfs from device /dev/vblk0.1 random: device unblocked. VFS: mounting devfs at /dev VFS: mounting procfs at /proc program zpool.so returned 1 BSD shrinker: event handler list found: 0xffffa00001716c80 BSD shrinker found: 1 BSD shrinker: unlocked, running [I/42 dhcp]: Broadcasting DHCPDISCOVER message with xid: [431505883] [I/244 dhcp]: Received DHCPOFFER message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.15 [I/244 dhcp]: Broadcasting DHCPREQUEST message with xid: [431505883] to SELECT offered IP: 192.168.122.15 [I/244 dhcp]: Received DHCPACK message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.15 [I/244 dhcp]: Server acknowledged IP 192.168.122.15 for interface eth0 with time to lease in seconds: 86400 eth0: 192.168.122.15 [I/244 dhcp]: Configuring eth0: ip 192.168.122.15 subnet mask 255.255.255.0 gateway 192.168.122.1 MTU 1500 [I/42 dhcp]: Waiting for IP... Running from /init/00-cmdline: /usr/mgmt/cloud-init.so;

Running from /init/30-auto-00: /libhttpserver.so &! could not load libicui18n.so.55 could not load libicuuc.so.55 could not load libicui18n.so.55 could not load libicuuc.so.55 [6ncould not load libssl.so.1.0.0 could not load libcrypto.so.1.0.0 ]]]

it then hangs.

I don't know if these "could not loads" are the issue but there might be something here...

Cheers,

Rowland

rowlandwatkins commented 7 years ago

OK, slightly misleading since it appears jenkins can't cope with scripts/run.py entering an interactive prompt, please ignore my last comment!

rowlandwatkins commented 7 years ago

So, running my app image in QEMU + scripts/ec2-simulator.py on my local machine, yields the following:

[[[ capstan run ]]]

[[[ eth0: 192.168.122.15 eth0: can't handle af168 cloud-init failed: connect failed Address family not supported by protocol [snip] ]]]

This is obviously a different environment from AWS, so this could easily be due to capstan/QEMU networking - I've with and without bridging and get the same result: the VM can talk to the EC2 simulator (performs a GET), but nothing else happens.

Running on our build server yields the same result - I can't seem to get the verbose-ness to remain when running this image as an instance on AWS...

Reading up on the "af168" and "Address family not supported by protocol", this suggests an IPv6 issue :/

Cheers,

Rowland

nyh commented 7 years ago

The missing libicu* libraries are fine - this is issue #600. The missing ssl is stranger, because httpserver does include it - but doesn't really explain any of the problem we're seeing. Address familiy "168" is not IPv6 - it's not anything reasonable. Looks like random junk? Perhaps you can figure out which code in cloud-init fills it, maybe it has a bug?

Are you running vanilla OSv, or something with a lot of your own patches? If it's the latter maybe you can try the clean OSv to see if you can reproduce this?

By the way, as a hack, you can edit loader.cc and initialize opt_verbose to true, so you don't need to set it manually.

nyh commented 7 years ago

@justinc1 about the "empty hostname" I probably misread the error - @rowlandwatkins reported the following error:

Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: : : Name does not resolve

I thought all these colons with nothing between them mean that we got an empty name, and that empty name "does not resolve". But maybe the problem isn't an empty name, but just some arbitrary name (I think we used "osv.local" or something), which, unsurprisingly, cannot be resolved in the DNS. However, I don't understand how this is a new error that didn't happen before? Would be good to check what this failing Java code is actually trying to do.

justinc1 commented 7 years ago

While I can compile and run scala-example, this doesn't reproduce the problem. The akka app is not there, and I was running on local machine, not on AWS. I imagine that adding printf into gethostname() might show if some temporal invalid value is there. There is no HOSTNAME env var in OSv, so we don't need to check that (well, that could be a problem, but it should not be a new problem).

Greping trough scala/akka code for pieces of error message didn't show much to me:

justin_cinkelj@jcpc:~/devel/mikelangelo/osv-fc25$ grep -RnI 'Local host name unknown' akka/ scala-2.10.6/
justin_cinkelj@jcpc:~/devel/mikelangelo/osv-fc25$ grep -RnI 'java.net.UnknownHostExceptio' akka/ scala-2.10.6/
akka/akka-cluster/src/multi-jvm/scala/akka/cluster/ClusterDeathWatchSpec.scala:52:      system.eventStream.publish(Mute(EventFilter[java.net.UnknownHostException]()))
scala-2.10.6/test/disabled/presentation/akka/src/com/eaio/uuid/UUIDGen.java:38:import java.net.UnknownHostException;

rowlandwatkins commented 7 years ago

@justinc1 the MalformedURLException is related when adding JMX, so I've disabled that for now...

I'm still chasing the option IPv6 is being used to open a connection to the cloud-init IP address. I've made some progress, it's happening in modules/cloud-init/client.cc#connect_with_timeout. There looks to be something going on with how the _endpoint is being initialised. Unfortunately, I'm not certain how C++ templating works so I'm having trouble verifying the _endpoint.protocol() (InternetProtocol::v6() or InternetProtocol::v6()). Any tips on how to reference InternetProtocol would be highly appreciated!

Cheers,

Rowland

nyh commented 7 years ago

Hi @rowlandwatkins I'm not familiar myself with boost::asio, but I would assume that if the IP address is a numeric IPv4 address (a so-called "dotted quad"), it would assume it is an IPv4. I also assume that you didn't give it an IPv6 numeric address (you would have noticed that). Perhaps you gave it a name, and your name server returned an IPv6 address? If you can reproduce this problem, you can add printouts inside cloud-init/client.cc and inside OSv to see what address is being connected and why, to hopefully understand what's going on. Good luck!

rowlandwatkins commented 7 years ago

@nyh all logs suggest the IP address resolved is IPv4. There hasn't been any changes to the OSv codebase except adding verbosity and cout statements. It's just a little odd that cloud-init refuses to work locally or on AWS. I did wonder if there's been major changes in boost since my last image from last year. I guess no one else has had an issue with cloud-init on AWS?

Cheers,

Rowland

justinc1 commented 7 years ago

refuses to work locally

Is it possible to reproduce this without scala/akka app? Can you send commands required to reproduce, from "git clone https://github.com/cloudius-systems/osv" onward?

rowlandwatkins commented 7 years ago

@justinc1 Yes, having the commandline just running Java without any arguments produces the same result. Here's my build procedure:

1) "git clone https://github.com/cloudius-systems/osv" 2) "git submodule update --init --recursive" 3) "./scripts/build image=httpserver,cloud-init,java,cli fs_size_mb=1024 jdkbase=/opt/tools/jdk1.8.0_131" 4) "capstan import ${version} build/release.x64/usr.img"

If I don't have the EC2 sim, then I'll get:

[[[ cloud-init failed: Failed getting cloud-init information No data source matched ]]]

With simulator running:

[[[ eth0: can't handle af168 cloud-init failed: connect Address family not supported by protocol ]]]

Cheers,

Rowland

nyh commented 7 years ago

The IPv6 address family is 10, not 168. If we're trying address family 168, it looks like we have a memory overflow, or use-before-set bug or something of this sort. Perhaps you can insert an abort() in the place in OSv which prints the "can't handle af" message, and in the debugger see what code path reached this, and try to figure out how.

rowlandwatkins commented 7 years ago

OK, digging further:

eth0: can't handle af168 is called in bsd/sys/net/if_ethersubr.cc#ether_output

There's a switch over bsd_sockaddr sa_family which doesn't seem to get matched at all:

[[[ default: if_print(ifp, "can't" handle af%d\n", dst->sa_family); senderr(EAFNOSUPPORT); ]]]

Successful matches would include INET and INET6. So the symptom here appears to be the address not having a valid sa family.

So, in all this, cloud-init.so --force-probe will probe then fallover when getting the metadata over HTTP due to this sockaddr family issue. bizarre and will keep digging.

Cheers, Rowland

nyh commented 7 years ago

Yes, I know that that code prints this message (this is the only "grep" match for this message), but of course the more interesting question is who set up this broken address, or what broke it. I'm starting to worry that maybe we're seeing here some bizarre case of version mismatch of Boost. Could you please try "scripts/build clean" (to clean up all the apps and modules, not just the OSv kernel) and then build again, to see if maybe that would help?

rowlandwatkins commented 7 years ago

OK, so I've narrowed this down to bsd/sys/netinet/ip_output.cc#ip_output - somehow the dst->sin_family gets overwritten from 2 (AF_INET) to 168.

Funnily enough, with some strategic print statements I've had cloud-init working on some boots, but it's now failing again inconsistently.

I've narrowed it down to the following code (in the sense that dst->sin_family gets mutated from 2 to 168:

287-292 [[[ if (rte->rt_flags & RTF_GATEWAY) dst = (struct bsd_sockaddr_in *)rte->rt_gateway; if (rte->rt_flags & RTF_HOST) isbroadcast = (rte->rt_flags & RTF_BROADCAST); else isbroadcast = in_broadcast(dst->sin_addr, ifp); ]]]

rte->rt_gateway->sa_family is 168. Soooo, I'm not sure how that's happening......

Cheers,

Rowland

nyh commented 7 years ago

Vilmos Nebehaj thinks this issue may have been solved by commit d9a87715c8ce545b4399853088eb3767565f43a7 - He said on the mailing list:

I think it's not the destination address, but the interface address in rtentry->rt_ifa. It's a pointer in struct rtentry, so memcpy() in route_cache::lookup() will copy the address, but then when the interface address is released (via SIOCDIFADDR), the pointer in the route_cache entry will become stale. But I only tested that invalidating the route_cache after SIOCDIFADDR fixed the connection problem in cloud-init, have not debugged rt_ifa any further.

justinc1 commented 7 years ago

I was just reading comments at start of routecache.hh - I got independently complain about "can't handle afX" and cloud-init not working. I agree the routeing cache is part of problem, as I can avoid problem if either:

dhcp.cc doesn't call osv::start_if for a second time (when it gets a second DHCPACK as part of DHCP renew), or
I comment out cache search in route_cache::lookup(), so that only 'slow lookup' is used.

Now I see I don't need to implement the invalidate(), as @ldx already did it. Huge thank to both of you, it was literary just in time. I will test it on our openstack now :)

justinc1 commented 7 years ago

Yes, cloud-init works now.

nyh commented 7 years ago

Excellent! So I'm closing this issue (please reopen if there's still a problem - there was a lot of discussion on this issue before we focused on the routing table thing, I'm not sure if anything in the beginning of the discussion is still relevant). Thanks Vilmos for fixing it.

yieldone commented 7 years ago

@justinc1 @nyh this is awesome, thanks so much! Will test this out.

cloudius-systems / osv

cloud-init initalisation #873