Open mik373 opened 9 years ago
Update:
We managed to get ssh to the public IP of default eth0 working consistently by setting eth0's default gateway with a unit:
- name: restart_network.service
command: start
content: |
[Unit]
Description=Set the Gateways
After=network-online.target
Wants=network-online.target
Before=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/route del default eth1
ExecStart=/usr/bin/route add default gw 172.20.0.1 eth0
The only issue we have right now is that the datadog agent 2 out of 3 times can't talk out to the datadog service:
2015-11-20 23:54:01 UTC | ERROR | dd.forwarder | forwarder(ddagent.py:267) | Response: HTTPResponse(_body=None,buffer=None,code=599,effective_url='https://5-5-1-app.agent.datadoghq.com/intake/?api_key=d9c988c950eb837f5583e676509734a9',error=HTTPError('HTTP 599: Timeout',),headers={},reason='Unknown',request=<tornado.httpclient.HTTPRequest object at 0x7f579c699fd0>,request_time=20.00114893913269,time_i
The datadog is started by this unit:
- name: datadog-agent.service
command: start
content: |
[Unit]
Description=Datadog
After=docker.service
Requires=docker.service
[Service]
Restart=always
EnvironmentFile=/etc/etcd-environment
ExecStartPre=-/usr/bin/docker kill dd-agent
ExecStartPre=-/usr/bin/docker rm dd-agent
ExecStartPre=/usr/bin/sleep 30
ExecStart=/usr/bin/docker run -h %H --name dd-agent \
--add-host=etcd:$${ETCD_LOCAL_HOST} \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /sys/fs/cgroup:/host/sys/fs/cgroup:ro \
-v /proc/:/host/proc/:ro \
-e API_KEY=key \
registry:${dd_agent_version}
ExecStop=/usr/bin/docker stop dd-agent
[X-Fleet]
MachineMetadata=role=worker
Global=True
Interestingly enough every time the container get the IP of
core@ip-172-20-0-11 ~ $ docker exec -it dd-agent ip route
default via 172.17.42.1 dev eth0
the agent can't talk out to the datadog service but if the IP issues is 172.17.42.2, the service is reachable.
The most appropriate method for configuring those interfaces is to provide your own .network configs. The "Match" section can be used to selectively apply configs to the various interfaces.
@mik373 Were you able to get this working with the networkd configs?
I can't use the static IPs configs for two reasons:
You should be able to define .network configs for each interface which enables DHCP. For the public interface gateway, use a lower routing metric to ensure egress packets deterministically use that interface.
So my etcd cluster with my config works about 80% of the time. The other 20% the interfaces are initialized in the order that creates asymmetrical ip routs and the cluster members can't dial each other. It seems that my issue might have to do with eth0 and eth1 coming from the same subnet which confuses the routes. I am trying to use a different subnet now for the launched instances but ssh times out when that's the case. Anything special I have to do on CoreOS level for ssh to work? The ingress rules are configured correctly.
@mik373 Sorry, I just noticed there was an open question from you. No, nothing special is needed on CoreOS for SSH to work. Are you still having trouble with this?
I am having same or very similar issue.
My setup is fairly similar, I have a bunch of instances with a single network interface to start with and then there is a daemon which attaches an additional ENI (eth1).
I found that systemd-networkd fails to bring up eth1 properly. I believe I am hitting this issue: https://github.com/systemd/systemd/issues/1784
So I have the following hack to make sure that eth1 comes up:
[Unit]
Description=Brings up eth1 when networkd fails to bring it up
[Service]
ExecStart=/usr/bin/bash -c 'while true; do ip -o -4 link show | grep -q "eth1:.*state DOWN" && ip link set up dev eth1; sleep 60; done'
The other problem that I just noticed is that if I reboot an instance which has two ENIs (eth0 and eth1) then the instance comes up with no network working, apart from eth1 because of the above hack.
This is quite a serious problem, because it prevents us from using CoreOS with more than one network interface on EC2.
I don't know if this can help anyone but I have instances on AWS with two interfaces. I was having the same problem when the eth1 would become active and then the computer rebooted I would lose network connectivity. The second interface adds another route and it messes with your eth0 setup , I added this to my /etc/systemd/network
[Match]
Name=eth1
[Network]
DHCP=ipv4
[DHCP]
UseDNS=false
SendHostname=true
UseRoutes=false
RouteMetric=2000
I believe if you use static ips and use a higher Route Metric that can help you also to not lose connectivity .
@vaijab can you give this another shot with the latest Alpha? That ships with a much newer version of systemd. @marcovnyc's suggestion to set the route metric is also interesting and might help out. I haven't had a chance to look into this yet.
Thanks @crawford. This is what I have in my user-data to make it work:
# This is a dirty workaround hack until this has been fixed: https://github.com/systemd/systemd/issues/1784
- name: networkd-restart.service
command: start
enable: true
content: |
[Unit]
Description=Restart systemd-networkd when DOWN interface is found
[Service]
ExecStart=/usr/bin/bash -c 'while true; do ip -o -4 link show | grep -q "eth[0-1]:.*state DOWN" && systemctl restart systemd-networkd; sleep 60; done'
Restart=always
RestartSec=10
- name: 20-eth1.network
runtime: false
content: |
[Match]
Name=eth1
[Network]
DHCP=ipv4
[DHCP]
UseDNS=false
SendHostname=true
UseRoutes=false
RouteMetric=2048
Is this issue still present with systemd 231?
Closing due to inactivity.
This is still an issue in 1911.4.0 as far as I can tell.
We've found this is an issue when using CoreOS (1911.3.0 at time of writing) with https://github.com/aws/amazon-vpc-cni-k8s/ in EC2.
When enough pods are scheduled onto an instance, additional interfaces/ENIs are created. Pod IPs are drawn from a pool of secondary IPs attached to each interface as an implementation detail of the Amazon VPC CNI. These new interfaces learn default routes via DHCP with a metric of 1024. After a reboot, the order of the default routes is undetermined and the node is then unreachable via the eth0
IP address if a non-eth0
default is "first" in the kernel's route table (ip route show | grep default
or similar to check).
We are working around currently by lowering the metric for the eth0
default route w/ an /etc/systemd/network/10-eth0-default-pref.network
systemd-networkd unit file like:
[Match]
Name=eth0
[Network]
DHCP=ipv4
[DHCP]
RouteMetric=512
Is systemd-udevd
running in the system?
(I've asked the same question in https://github.com/systemd/systemd/issues/1784. Sorry for multi-posting.)
@yuwata systemd-udevd
does run on Container Linux.
@bgilbert Thanks. I'd like to ask one more thing: please provide the results of systemd-detect-virt
and systemd-detect-virt --container
.
BTW, if you think this is a bug in networkd or udevd, then please open a new issue in systemd and provide debugging logs of the daemons: booting with systemd.log_level=debug udev.log_priority=debug
and journalctl -b -u systemd-networkd.service -u systemd-udevd.service --no-hostname
may be sufficient. Thank you.
Not sure, but https://github.com/systemd/systemd/pull/11881 may fix this issue.
Hi Experts I am using coreos CoreOS-stable-2135.5.0-hvm (ami-049ed451bb483d4be) and found this issue still exists. Is there a corresponding solution and bug fix plan?
Seems to still be a problem in CoreOS-stable-2191.5.0-hvm (ami-038cea5071a5ee580).
Scenario: