Closed rowlandwatkins closed 7 years ago
The "Operation Canceled" (ECANCELED) message is fishy, because nothing uses this errno. Looking at modules/cloud-init/client.cc, connect_with_timeout(), maybe this is what we see in case of timeout? Maybe you can add printouts to confirm this. I'm not familiar with the Boost ASIO library to tell.
I seem to remember discussing with @justinc1 something about the initialization order of cloud-init and httpserver - I think this code tries to communicate with httpserver, which might not be alive (yet?). Maybe this has something to do with this?
Can you run the kernel with --verbose (e.g., run.py -V) and maybe see warnings about missing libraries? Does your image include httpserver?
@nyh Thanks for the info!
OK, while I'm adding printouts, here's the build settings:
[[[ ./scripts/build image=httpserver,cloud-init,java,cli fs_size_mb=1024 jdkbase=/opt/tools/jdk1.8.0_131 ]]]
So I think httpserver is there, but not sure beyond that. It looks like there's been a load of changes late last year that could have moved things around.
Can I do anything with my AWS kernel to increase verbosity, like from capstan?
Cheers,
Rowland
Yes, it does look like you have httpserver there. https://github.com/cloudius-systems/osv/issues/830 discussed the order issue, but I think it should work - I don't think we changed any of this in the last year.
Regarding the empty host name error you saw - @justinc1 did change this area recently - see for example https://github.com/cloudius-systems/osv/issues/866 - so maybe there's still a bug? OSv has a silly default hostname ("osv") and we shouldn't override it with an empty string (that we supposedly got from cloud-init? or what?).
If you run your kernel locally once with "run.py -V" (before you convert it to the format you need for AWS) it will change the command line in the image to include "--verbose" for you. I am not familiar enough with Capstan to tell you how to do it in Capstan. But anyway, it seems like you do have httpserver there, so you won't see missing libraries or someting like I thought you might see.
There was no code commited to "fix" #830 - the problem was only that I didn't understood how can cloud-init use httpserver API, if httpserver is started after cloud-init. Nadav explained in https://github.com/cloudius-systems/osv/issues/830#issuecomment-272814748, that cloud-init links with httpserver library, so it has "embedded httpserver".
Empty hostname: I see two places where cloud-init and/or dhpc uses sethostname: https://github.com/cloudius-systems/osv/blob/master/modules/cloud-init/cloud-init.cc#L29 https://github.com/cloudius-systems/osv/blob/master/core/dhcp.cc#L724 Both do check for string length, so new name should not be empty. But, it could still be invalid DNS name. On second reading of the issue - I guess when @nyh said "empty hostname", he was referring to "Local host name unknown" and "Name does not resolve" error?
Hi @nyh ,
The plot thickens - here's output from scripts/run.py -V:
[[[ [osv-daily] Running shell script
Running from /init/30-auto-00: /libhttpserver.so &! could not load libicui18n.so.55 could not load libicuuc.so.55 could not load libicui18n.so.55 could not load libicuuc.so.55 [6ncould not load libssl.so.1.0.0 could not load libcrypto.so.1.0.0 ]]]
it then hangs.
I don't know if these "could not loads" are the issue but there might be something here...
Cheers,
Rowland
OK, slightly misleading since it appears jenkins can't cope with scripts/run.py entering an interactive prompt, please ignore my last comment!
So, running my app image in QEMU + scripts/ec2-simulator.py on my local machine, yields the following:
[[[ capstan run ]]]
[[[ eth0: 192.168.122.15 eth0: can't handle af168 cloud-init failed: connect failed Address family not supported by protocol [snip] ]]]
This is obviously a different environment from AWS, so this could easily be due to capstan/QEMU networking - I've with and without bridging and get the same result: the VM can talk to the EC2 simulator (performs a GET), but nothing else happens.
Running on our build server yields the same result - I can't seem to get the verbose-ness to remain when running this image as an instance on AWS...
Reading up on the "af168" and "Address family not supported by protocol", this suggests an IPv6 issue :/
Cheers,
Rowland
The missing libicu* libraries are fine - this is issue #600. The missing ssl is stranger, because httpserver does include it - but doesn't really explain any of the problem we're seeing. Address familiy "168" is not IPv6 - it's not anything reasonable. Looks like random junk? Perhaps you can figure out which code in cloud-init fills it, maybe it has a bug?
Are you running vanilla OSv, or something with a lot of your own patches? If it's the latter maybe you can try the clean OSv to see if you can reproduce this?
By the way, as a hack, you can edit loader.cc and initialize opt_verbose to true, so you don't need to set it manually.
@justinc1 about the "empty hostname" I probably misread the error - @rowlandwatkins reported the following error:
Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: : : Name does not resolve
I thought all these colons with nothing between them mean that we got an empty name, and that empty name "does not resolve". But maybe the problem isn't an empty name, but just some arbitrary name (I think we used "osv.local" or something), which, unsurprisingly, cannot be resolved in the DNS. However, I don't understand how this is a new error that didn't happen before? Would be good to check what this failing Java code is actually trying to do.
While I can compile and run scala-example, this doesn't reproduce the problem. The akka app is not there, and I was running on local machine, not on AWS. I imagine that adding printf into gethostname() might show if some temporal invalid value is there. There is no HOSTNAME env var in OSv, so we don't need to check that (well, that could be a problem, but it should not be a new problem).
Greping trough scala/akka code for pieces of error message didn't show much to me:
justin_cinkelj@jcpc:~/devel/mikelangelo/osv-fc25$ grep -RnI 'Local host name unknown' akka/ scala-2.10.6/
justin_cinkelj@jcpc:~/devel/mikelangelo/osv-fc25$ grep -RnI 'java.net.UnknownHostExceptio' akka/ scala-2.10.6/
akka/akka-cluster/src/multi-jvm/scala/akka/cluster/ClusterDeathWatchSpec.scala:52: system.eventStream.publish(Mute(EventFilter[java.net.UnknownHostException]()))
scala-2.10.6/test/disabled/presentation/akka/src/com/eaio/uuid/UUIDGen.java:38:import java.net.UnknownHostException;
@justinc1 the MalformedURLException is related when adding JMX, so I've disabled that for now...
I'm still chasing the option IPv6 is being used to open a connection to the cloud-init IP address. I've made some progress, it's happening in modules/cloud-init/client.cc#connect_with_timeout. There looks to be something going on with how the _endpoint is being initialised. Unfortunately, I'm not certain how C++ templating works so I'm having trouble verifying the _endpoint.protocol() (InternetProtocol::v6() or InternetProtocol::v6()). Any tips on how to reference InternetProtocol would be highly appreciated!
Cheers,
Rowland
Hi @rowlandwatkins I'm not familiar myself with boost::asio, but I would assume that if the IP address is a numeric IPv4 address (a so-called "dotted quad"), it would assume it is an IPv4. I also assume that you didn't give it an IPv6 numeric address (you would have noticed that). Perhaps you gave it a name, and your name server returned an IPv6 address? If you can reproduce this problem, you can add printouts inside cloud-init/client.cc and inside OSv to see what address is being connected and why, to hopefully understand what's going on. Good luck!
@nyh all logs suggest the IP address resolved is IPv4. There hasn't been any changes to the OSv codebase except adding verbosity and cout statements. It's just a little odd that cloud-init refuses to work locally or on AWS. I did wonder if there's been major changes in boost since my last image from last year. I guess no one else has had an issue with cloud-init on AWS?
Cheers,
Rowland
refuses to work locally
Is it possible to reproduce this without scala/akka app? Can you send commands required to reproduce, from "git clone https://github.com/cloudius-systems/osv" onward?
@justinc1 Yes, having the commandline just running Java without any arguments produces the same result. Here's my build procedure:
1) "git clone https://github.com/cloudius-systems/osv" 2) "git submodule update --init --recursive" 3) "./scripts/build image=httpserver,cloud-init,java,cli fs_size_mb=1024 jdkbase=/opt/tools/jdk1.8.0_131" 4) "capstan import ${version} build/release.x64/usr.img"
If I don't have the EC2 sim, then I'll get:
[[[ cloud-init failed: Failed getting cloud-init information No data source matched ]]]
With simulator running:
[[[ eth0: can't handle af168 cloud-init failed: connect Address family not supported by protocol ]]]
Cheers,
Rowland
The IPv6 address family is 10, not 168. If we're trying address family 168, it looks like we have a memory overflow, or use-before-set bug or something of this sort. Perhaps you can insert an abort() in the place in OSv which prints the "can't handle af" message, and in the debugger see what code path reached this, and try to figure out how.
OK, digging further:
eth0: can't handle af168 is called in bsd/sys/net/if_ethersubr.cc#ether_output
There's a switch over bsd_sockaddr sa_family which doesn't seem to get matched at all:
[[[ default: if_print(ifp, "can't" handle af%d\n", dst->sa_family); senderr(EAFNOSUPPORT); ]]]
Successful matches would include INET and INET6. So the symptom here appears to be the address not having a valid sa family.
So, in all this, cloud-init.so --force-probe will probe then fallover when getting the metadata over HTTP due to this sockaddr family issue. bizarre and will keep digging.
Cheers, Rowland
Yes, I know that that code prints this message (this is the only "grep" match for this message), but of course the more interesting question is who set up this broken address, or what broke it. I'm starting to worry that maybe we're seeing here some bizarre case of version mismatch of Boost. Could you please try "scripts/build clean" (to clean up all the apps and modules, not just the OSv kernel) and then build again, to see if maybe that would help?
OK, so I've narrowed this down to bsd/sys/netinet/ip_output.cc#ip_output - somehow the dst->sin_family gets overwritten from 2 (AF_INET) to 168.
Funnily enough, with some strategic print statements I've had cloud-init working on some boots, but it's now failing again inconsistently.
I've narrowed it down to the following code (in the sense that dst->sin_family gets mutated from 2 to 168:
287-292 [[[ if (rte->rt_flags & RTF_GATEWAY) dst = (struct bsd_sockaddr_in *)rte->rt_gateway; if (rte->rt_flags & RTF_HOST) isbroadcast = (rte->rt_flags & RTF_BROADCAST); else isbroadcast = in_broadcast(dst->sin_addr, ifp); ]]]
rte->rt_gateway->sa_family is 168. Soooo, I'm not sure how that's happening......
Cheers,
Rowland
Vilmos Nebehaj thinks this issue may have been solved by commit d9a87715c8ce545b4399853088eb3767565f43a7 - He said on the mailing list:
I think it's not the destination address, but the interface address in rtentry->rt_ifa. It's a pointer in struct rtentry, so memcpy() in route_cache::lookup() will copy the address, but then when the interface address is released (via SIOCDIFADDR), the pointer in the route_cache entry will become stale. But I only tested that invalidating the route_cache after SIOCDIFADDR fixed the connection problem in cloud-init, have not debugged rt_ifa any further.
I was just reading comments at start of routecache.hh - I got independently complain about "can't handle afX" and cloud-init not working. I agree the routeing cache is part of problem, as I can avoid problem if either:
Now I see I don't need to implement the invalidate(), as @ldx already did it. Huge thank to both of you, it was literary just in time. I will test it on our openstack now :)
Yes, cloud-init works now.
Excellent! So I'm closing this issue (please reopen if there's still a problem - there was a lot of discussion on this issue before we focused on the routing table thing, I'm not sure if anything in the beginning of the discussion is still relevant). Thanks Vilmos for fixing it.
@justinc1 @nyh this is awesome, thanks so much! Will test this out.
Hi folks,
I'm getting some interesting errors on startup on AWS:
[[[ cloud-init failed: connect failed Operation canceled java.so: Starting JVM app using: io/osv/isolated/RunIsolatedJvmApp java.so: Setting Java system classloader to IsolatingOsvSystemClassLoader and logging manager to IsolatingLogManager ]]]
With the following commandline:
[[[ /java.so -Dlogback.configurationFile=/conf/logging/logback.xml -Dconfig.file=/conf/application.conf -cp /lib/* akka.kernel.Main com.blah.Main ]]]
If I include remote JMX remote, I'd get the following:
[[[ Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException:: : Name does not resolve
]]]
cloud-init is initialised as:
/init/00-cmdline: [[[ /usr/mgmt/cloud-init.so --force-probe ]]]
I'm wondering if my approach to initialising cloud-init is now out of date... Any tips to getting cloud-init back up?
Cheers,
Rowland