canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

MAC address already in use #11053

Closed redno2 closed 1 year ago

redno2 commented 2 years ago

Issue description

I use github action to spawn and destroy some containers, after few runs and jobs canceling which was working great until here. Now I cannot start it because it seem the MAC address is still use, but not really at least not in existing containers. Seem to be keep in the db and when I create the same container with the same MAC address, I got the message the MAC address is already in use, when I start it.

Steps to reproduce

  1. Step one Create the container with a specific cloud-init and MAC address
  2. Step two Delete it and create it multiple time, as it is a github action process.
  3. Step three Works well until here, but probably after canceling and start my job in github action which delete and create the same container multiple time, I got the error

    Error: Failed to start device "eth0": Failed to set the MAC address: Failed to run: ip link set dev macb905583c address 00:16:3e:12:12:12: exit status 2 (RTNETLINK answers: Address already in use)

    Information to attach

    • [ ] Any relevant kernel output (dmesg)
    • [ ] Container log (lxc info NAME --show-log)
      $ lxc info pc-efkolos-desktop-jammy --show-log
      Name: pc-efkolos-desktop-jammy
      Status: STOPPED
      Type: virtual-machine
      Architecture: x86_64
      Created: 2022/10/24 17:59 CEST
      Last Used: 2022/10/25 09:28 CEST
      Error: open /var/snap/lxd/common/lxd/logs/pc-efkolos-desktop-jammy/qemu.log: no such file or directory
    • [ ] Container configuration (lxc config show NAME --expanded)
      
      architecture: x86_64 
      config:                  
      boot.autostart: "false"                      
      image.architecture: amd64     
      image.description: Ubuntu jammy amd64 (desktop) (20220801_09:23)
      image.name: ubuntu-jammy-amd64-desktop-20220801_09:23
      image.os: ubuntu       
      image.release: jammy                         
      image.serial: "20220801_09:23"                                                              
      image.variant: desktop  
      limits.cpu: "4"                                                                             
      limits.memory: 4GB                           
      limits.memory.swap: "false"                
      user.user-data: |
      #cloud-config                                                                             
      # User and Group Management                                                               
      users:
      - name: action-runner
      groups: sudo
      primary_group: action-runner
      shell: /bin/bash
      sudo: ['ALL=(ALL) NOPASSWD:ALL']
      lock_passwd: false
      passwd: "HASH"
      ssh-authorized-keys:
        - ssh-rsa 1234
      - name: ubuntu
      groups: sudo
      shell: /bin/bash
      sudo: ['ALL=(ALL) NOPASSWD:ALL']
      lock_passwd: false
      passwd: "HASH"
      ssh-authorized-keys:
        - ssh-rsa 1234
      ssh_authorized_keys:
      - ssh-rsa 1234

    Run apt upgrade

    package_upgrade: true

    Install arbitrary packages

    packages:

    • openssh-server
    • curl

    Add apt repositories

    apt_mirror: http://mirror.local

    Run Arbitrary Commands for More Control

    runcmd:

    • chown -R action-runner:action-runner /home/action-runner
    • cp /etc/skel/.bashrc /home/action-runner/
    • touch /home/action-runner/.bash_profile
    • echo "https_proxy=http://127.0.0.1:3129" | tee -a /etc/environment /etc/bash.bashrc
    • echo "http_proxy=http://127.0.0.1:3129" | tee -a /etc/environment /etc/bash.bashrc
    • echo "no_proxy=127.0.0.1,localhost,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255" | tee -a /etc/environment /etc/bash.bashrc
    • sed -i "/.*_proxy/s/^/export\ /" /etc/bash.bashrc
    • apt purge -y unattended-upgrades

    power_state: delay: now mode: poweroff message: Bye Bye timeout: 5 condition: True volatile.base_image: a0d63a06d6d5eb55cc999b2a6b5578273526b61dde9b60333a70ef4af9b0ec34 volatile.cloud-init.instance-id: f6b68b1d-4352-4302-a5b9-c1a4ca1d987d volatile.eth0.hwaddr: 00:16:3e:12:12:12 volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: f9a84a5a-ae53-4fba-a7af-1fcb49c52430 volatile.vsock_id: "357" devices: eth0: name: eth0 nictype: macvlan parent: lxdbr121 type: nic root: path: / pool: lxd_pool size: 16GiB type: disk sharedHomeDir: path: /opt/pc-efkolos readonly: "True" source: /home/action-runner/actions-runner-2/_work/pc-efkolos/pc-efkolos type: disk ephemeral: false profiles:

    • pc-efkolos stateful: false description: ""
    • [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

    cat /var/snap/lxd/common/lxd/logs/lxd.log

    time="2022-10-24T18:14:39+02:00" level=warning msg=" - Couldn't find the CGroup blkio.weight, disk priority will be ignored" time="2022-10-24T18:14:39+02:00" level=warning msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" time="2022-10-24T18:14:44+02:00" level=error msg="Invalid configuration key: unknown key" key=storage.lvm_fstype time="2022-10-24T18:14:44+02:00" level=error msg="Invalid configuration key: unknown key" key=storage.lvm_mount_options time="2022-10-24T18:14:44+02:00" level=error msg="Invalid configuration key: unknown key" key=storage.lvm_thinpool_name time="2022-10-24T18:14:44+02:00" level=error msg="Invalid configuration key: unknown key" key=storage.lvm_volume_size time="2022-10-24T18:14:44+02:00" level=warning msg="Failed to initialize fanotify, falling back on fsnotify" err="Failed to initialize fanotify: invalid argument" time="2022-10-24T18:14:45+02:00" level=warning msg="Failed to update instance types: Get \"https://uk.lxd.images.canonical.com/meta/instance-types/.yaml\": Forbidden" time="2022-10-25T09:43:07+02:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/pc-efkolos-desktop-jammy/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="" time="2022-10-25T10:09:22+02:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/pc-efkolos-desktop-jammy/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr=""

    
    
    - [ ] Output of the client with --debug
    - [ ] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue)
    location: none
    metadata:
    context:
    class: task
    description: Starting instance
    err: 'Failed to start device "eth0": Failed to set the MAC address: Failed to
      run: ip link set dev mac218dfc5f address 00:16:3e:12:12:12: exit status 2 (RTNETLINK
      answers: Address already in use)'
    operation: 6ca9cbfb-abb1-42f7-93af-6f99e7207748
    project: default
    level: debug
    message: Failure for operation
    timestamp: "2022-10-25T10:58:54.919550399+02:00"
    type: logging
tomponline commented 2 years ago

Can you get the output of ip l on the system when you get the error?

tobiaspal commented 2 years ago

I seem to have stumbled on the same issue.

# lxc start production-jenkinsmaster02
Error: Failed to start device "eth0": Failed to set the MAC address: Failed to run: ip link set dev mac6289d050 address aa:00:00:c2:48:94: exit status 2 (RTNETLINK answers: Address already in use)
Try `lxc info --show-log production-jenkinsmaster02` for more info
# lxc info --show-log production-jenkinsmaster02
Name: production-jenkinsmaster02
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: osg-node3
Created: 2022/04/24 03:42 +08
Last Used: 2022/10/29 06:17 +08
Error: open /var/snap/lxd/common/lxd/logs/production-jenkinsmaster02/qemu.log: no such file or directory

I ran ip l and I saw there is a device already with the same MAC:

# ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 08:94:ef:10:83:44 brd ff:ff:ff:ff:ff:ff
    altname enp22s0f0
3: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master trunk state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff permaddr 08:94:ef:10:83:45
    altname enp22s0f1
4: eno3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master trunk state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff permaddr 08:94:ef:10:83:46
    altname enp22s0f2
5: eno4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master trunk state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff permaddr 08:94:ef:10:83:47
    altname enp22s0f3
6: enx0a94ef10834b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 0a:94:ef:10:83:4b brd ff:ff:ff:ff:ff:ff
7: trunk: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff
8: trunk.41@trunk: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff
10: macdc1379a0@trunk.41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP mode DEFAULT group default qlen 500
    link/ether aa:00:00:5d:0d:a9 brd ff:ff:ff:ff:ff:ff
22: trunk.252@trunk: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7e:8e:a3:99:bb:0a brd ff:ff:ff:ff:ff:ff
24: maca6aee821@trunk.41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP mode DEFAULT group default qlen 500
    link/ether aa:00:00:c2:48:94 brd ff:ff:ff:ff:ff:ff
25: mac6f081606@trunk.41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP mode DEFAULT group default qlen 500
    link/ether aa:00:00:4d:cf:ef brd ff:ff:ff:ff:ff:ff
29: mac1907947c@trunk.41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP mode DEFAULT group default qlen 500
    link/ether aa:00:00:5e:47:f8 brd ff:ff:ff:ff:ff:ff

So I deleted that device and that allowed me to start the instance:

# ip l del dev maca6aee821
# lxc start production-jenkinsmaster02

Here is some info about my environment:

# uname -a
Linux osg-node3 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# lxc --version
5.7
# lxd --version
5.7
# lxc config show production-jenkinsmaster02 --expanded
architecture: x86_64
config:
  limits.cpu: "48"
  limits.memory: 16GB
  security.nesting: "true"
  volatile.cloud-init.instance-id: 16b013a6-f75e-473a-a4c7-b01c7320404a
  volatile.eth0.host_name: mac71c65c73
  volatile.eth0.hwaddr: aa:00:00:c2:48:94
  volatile.eth0.last_state.created: "true"
  volatile.last_state.power: RUNNING
  volatile.uuid: b1e5ea52-5f5d-4149-b13e-d07bc777a34b
  volatile.vsock_id: "123"
devices:
  eth0:
    name: eth0
    nictype: macvlan
    parent: trunk
    type: nic
    vlan: "41"
  root:
    path: /
    pool: zfs-rpool
    type: disk
ephemeral: false
profiles:
- default
stateful: false

The only log from the same day in /var/snap/lxd/common/lxd/logs/lxd.log is this:

time="2022-10-31T09:15:08+08:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/production-jenkinsmaster02/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="<nil>"

Unfortunately I don't have monitor logs.

redno2 commented 2 years ago

I confirm the link was still here after the delete container (which was a VM).

ip l

....
183: macae0400be@lxdbr121: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 500
    link/ether 00:16:3e:12:12:12 brd ff:ff:ff:ff:ff:ff

ip l del macae0400be

and fix my issue, I can start the container

tomponline commented 2 years ago

This can occur if the previously stopped container didn't cleanly stop and LXD never got a chance to remove the old interface.

Can you try and capture the output of /var/snap/lxd/common/lxd/logs/lxd.log the next time this happens.

Also this can be caused by an issue in the kernel if for some reason it does not release the interface.

tomponline commented 1 year ago

Closing this for now. But if you are able to get the logs as requested please can you post over at https://discuss.linuxcontainers.org/ as we prefer to handle support cases there. Thanks