application-research / ehi-proxmaas

Integrates Proxmox, MAAS and Technitium to provide one-touch deployment of many machines
MIT License
1 stars 1 forks source link

Internal Server Error: unable to create VM 206 - close (rename) atomic file #21

Closed PC-Admin closed 1 year ago

PC-Admin commented 1 year ago

This is a strange one:

TASK [Create the requested VMs] **********************************************************************************************************************
changed: [localhost] => (item=21)
failed: [localhost] (item=22) => {"ansible_loop_var": "item", "changed": false, "item": 22, "msg": "Reached timeout while waiting for creating VM. Last line in task before timeout: [{'n': 1, 't': '/dev/rbd34'}]"}
failed: [localhost] (item=23) => {"ansible_loop_var": "item", "changed": false, "item": 23, "msg": "creation of qemu VM prod-phos-k8s-w23 with vmid 150 failed with exception=HTTPSConnectionPool(host='proxmox.estuary.tech', port=8006): Read timed out. (read timeout=5)", "vmid": "150"}
changed: [localhost] => (item=24)
changed: [localhost] => (item=25)
changed: [localhost] => (item=26)
changed: [localhost] => (item=27)
changed: [localhost] => (item=28)
changed: [localhost] => (item=29)
changed: [localhost] => (item=30)
failed: [localhost] (item=31) => {"ansible_loop_var": "item", "changed": false, "item": 31, "msg": "creation of qemu VM prod-phos-k8s-w31 with vmid 200 failed with exception=HTTPSConnectionPool(host='proxmox.estuary.tech', port=8006): Read timed out. (read timeout=5)", "vmid": "200"}
failed: [localhost] (item=32) => {"ansible_loop_var": "item", "changed": false, "item": 32, "msg": "creation of qemu VM prod-phos-k8s-w32 with vmid 200 failed with exception=500 Internal Server Error: unable to create VM 200 - close (rename) atomic file '/etc/pve/nodes/altair/qemu-server/200.conf' failed: File exists", "vmid": "200"}
failed: [localhost] (item=33) => {"ansible_loop_var": "item", "changed": false, "item": 33, "msg": "creation of qemu VM prod-phos-k8s-w33 with vmid 201 failed with exception=HTTPSConnectionPool(host='proxmox.estuary.tech', port=8006): Read timed out. (read timeout=5)", "vmid": "201"}
failed: [localhost] (item=34) => {"ansible_loop_var": "item", "changed": false, "item": 34, "msg": "creation of qemu VM prod-phos-k8s-w34 with vmid 201 failed with exception=HTTPSConnectionPool(host='proxmox.estuary.tech', port=8006): Read timed out. (read timeout=5)", "vmid": "201"}
changed: [localhost] => (item=35)
changed: [localhost] => (item=36)
changed: [localhost] => (item=37)
changed: [localhost] => (item=38)
failed: [localhost] (item=39) => {"ansible_loop_var": "item", "changed": false, "item": 39, "msg": "creation of qemu VM prod-phos-k8s-w39 with vmid 206 failed with exception=HTTPSConnectionPool(host='proxmox.estuary.tech', port=8006): Read timed out. (read timeout=5)", "vmid": "206"}
failed: [localhost] (item=40) => {"ansible_loop_var": "item", "changed": false, "item": 40, "msg": "creation of qemu VM prod-phos-k8s-w40 with vmid 206 failed with exception=500 Internal Server Error: unable to create VM 206 - close (rename) atomic file '/etc/pve/nodes/altair/qemu-server/206.conf' failed: File exists", "vmid": "206"}
PC-Admin commented 1 year ago

interestingly i also got a new error with destroy.yml today as well:

failed: [localhost] (item=25) => {"ansible_loop_var": "item", "changed": false, "item": 25, "msg": "Reached timeout while waiting for removing VM. Last line in task before timeout: [{'t': \"Could not remove disk 'vm-storage:vm-192-disk-1', check manually: cfs-lock 'storage-vm-storage' error: no quorum!\", 'n': 1}]"}

This is related to recent networking failures it seems.

PC-Admin commented 1 year ago

Moved the playbooks over to FDI in hopes of making it more reliable. (The networks really being hammered at the moment.)

Ended up seeing this new bug, VM 200 definitely existed but I'm guessing we might need a poll here as well:

TASK [Start the requested VMs] ***********************************************************************************************************************
failed: [localhost] (item={'changed': True, 'msg': 'VM prod-phos-k8s-w29 with vmid 200 deployed', 'mac': {'net0': 'CA:D6:D2:34:A4:FB'}, 'devices': {'scsi0': 'vm-storage:vm-200-disk-2'}, 'vmid': 200, 'invocation': {'module_args': {'api_user': 'root@pam', 'api_token_id': 'ansible-maas', 'api_token_secret': 'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER', 'api_host': 'sirius.estuary.tech', 'timeout': 60, 'name': 'prod-phos-k8s-w29', 'memory': 65536, 'balloon': 16384, 'cores': 6, 'agent': 'True', 'description': 'Production Phosphophyllite Kubernetes Worker node', 'onboot': False, 'boot': 'nc', 'bootdisk': 'scsi0', 'cpu': 'host', 'node': 'sirius', 'scsihw': 'virtio-scsi-single', 'scsi': {'scsi0': 'vm-storage:100,format=raw,discard=on,ssd=1'}, 'net': {'net0': 'bridge=vmbr0,virtio,mtu=1,firewall=1'}, 'bios': 'ovmf', 'efidisk0': {'format': 'raw', 'efitype': '4m', 'pre_enrolled_keys': False}, 'validate_certs': False, 'full': True, 'state': 'present', 'update': False, 'proxmox_default_behavior': 'no_defaults', 'api_password': None, 'archive': None, 'acpi': None, 'args': None, 'autostart': None, 'cicustom': None, 'cipassword': None, 'citype': None, 'ciuser': None, 'clone': None, 'cpulimit': None, 'cpuunits': None, 'delete': None, 'digest': None, 'force': None, 'format': None, 'freeze': None, 'hostpci': None, 'hotplug': None, 'hugepages': None, 'ide': None, 'ipconfig': None, 'keyboard': None, 'kvm': None, 'localtime': None, 'lock': None, 'machine': None, 'migrate_downtime': None, 'migrate_speed': None, 'newid': None, 'numa': None, 'numa_enabled': None, 'ostype': None, 'parallel': None, 'pool': None, 'protection': None, 'reboot': None, 'revert': None, 'sata': None, 'serial': None, 'shares': None, 'skiplock': None, 'smbios': None, 'snapname': None, 'sockets': None, 'sshkeys': None, 'startdate': None, 'startup': None, 'storage': None, 'tablet': None, 'tags': None, 'target': None, 'tdf': None, 'template': None, 'vcpus': None, 'vga': None, 'virtio': None, 'vmid': None, 'watchdog': None}}, 'failed': False, 'item': 29, 'ansible_loop_var': 'item'}) => {"ansible_loop_var": "item", "changed": false, "item": {"ansible_loop_var": "item", "changed": true, "devices": {"scsi0": "vm-storage:vm-200-disk-2"}, "failed": false, "invocation": {"module_args": {"acpi": null, "agent": "True", "api_host": "sirius.estuary.tech", "api_password": null, "api_token_id": "ansible-maas", "api_token_secret": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER", "api_user": "root@pam", "archive": null, "args": null, "autostart": null, "balloon": 16384, "bios": "ovmf", "boot": "nc", "bootdisk": "scsi0", "cicustom": null, "cipassword": null, "citype": null, "ciuser": null, "clone": null, "cores": 6, "cpu": "host", "cpulimit": null, "cpuunits": null, "delete": null, "description": "Production Phosphophyllite Kubernetes Worker node", "digest": null, "efidisk0": {"efitype": "4m", "format": "raw", "pre_enrolled_keys": false}, "force": null, "format": null, "freeze": null, "full": true, "hostpci": null, "hotplug": null, "hugepages": null, "ide": null, "ipconfig": null, "keyboard": null, "kvm": null, "localtime": null, "lock": null, "machine": null, "memory": 65536, "migrate_downtime": null, "migrate_speed": null, "name": "prod-phos-k8s-w29", "net": {"net0": "bridge=vmbr0,virtio,mtu=1,firewall=1"}, "newid": null, "node": "sirius", "numa": null, "numa_enabled": null, "onboot": false, "ostype": null, "parallel": null, "pool": null, "protection": null, "proxmox_default_behavior": "no_defaults", "reboot": null, "revert": null, "sata": null, "scsi": {"scsi0": "vm-storage:100,format=raw,discard=on,ssd=1"}, "scsihw": "virtio-scsi-single", "serial": null, "shares": null, "skiplock": null, "smbios": null, "snapname": null, "sockets": null, "sshkeys": null, "startdate": null, "startup": null, "state": "present", "storage": null, "tablet": null, "tags": null, "target": null, "tdf": null, "template": null, "timeout": 60, "update": false, "validate_certs": false, "vcpus": null, "vga": null, "virtio": null, "vmid": null, "watchdog": null}}, "item": 29, "mac": {"net0": "CA:D6:D2:34:A4:FB"}, "msg": "VM prod-phos-k8s-w29 with vmid 200 deployed", "vmid": 200}, "msg": "VM with name = prod-phos-k8s-w29 does not exist in cluster"}
PC-Admin commented 1 year ago

I believe all of these bugs were a result of Apollo being hammered and the corosync of Proxmox failing. Since Apollo has had it's Proxmox hosts disabled proxmaas now seems more reliable.