chong601 commented 4 years ago

Required information

Distribution: Ubuntu
Distribution version: 16.04.6
The output of "lxc info" or if that fails:
- Kernel version: 4.15.0-88-generic
- LXC version: 3.20
- LXD version: 3.20
- Storage backend in use: zfs

Issue description

As a preparation to avoid issue #6913, I have decided to downgrade from latest LXD to the 3.20. However, when I attempted to import the existing container after uninstalling LXD 3.21 and installing LXD 3.20, I get the following error: Error: Create container: Invalid devices: Device validation failed "osm-postgres-index": The "secondary-fio-cache" storage pool doesn't exist

Attempted fixes:

Use lxd init --preseed and used the LXD YAML dump created earlier: failed because the dataset used is not empty
Create new storage pool using existing ZFS dataset containing the container: same issue as above

Only alternative is to lose out the dataset and recreate from scratch.

Let me know if require more information.

Steps to reproduce

Install LXD 3.21
Initialize LXD with ZFS pool
Create at least one storage pool
Create custom volume
Create a container
Attach the created custom volume to the container
Shut down (or stop) container
snap remove lxd
snap install lxd --channel=3.20/stable
Mount all datasets used by the container
1. sudo lxd import <container>

Information to attach

[ ] LXD preseed file

config:
images.auto_update_interval: "0"
networks: []
storage_pools:
- config:
source: secondary-fio-cache/lxd-area
volatile.initial_source: secondary-fio-cache/lxd-area
zfs.pool_name: secondary-fio-cache/lxd-area
description: ""
name: secondary-fio-cache
driver: zfs
- config:
source: secondary-fio-storage/lxd-area
volatile.initial_source: secondary-fio-storage/lxd-area
zfs.pool_name: secondary-fio-storage/lxd-area
description: ""
name: secondary-fio-storage
driver: zfs
profiles:
- config: {}
description: Default LXD profile
devices:
eth0:
  name: eth0
  nictype: bridged
  parent: instance-br0
  type: nic
root:
  path: /
  pool: secondary-fio-storage
  type: disk
name: default

[ ] Output of the client with --debug


DBUG[02-21|21:37:27] Connecting to a local LXD over a Unix socket
DBUG[02-21|21:37:27] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[02-21|21:37:27] Got response struct from LXD
DBUG[02-21|21:37:27]
    {
            "config": {},
            "api_extensions": [
                    "storage_zfs_remove_snapshots",
                    "container_host_shutdown_timeout",
                    "container_stop_priority",
                    "container_syscall_filtering",
                    "auth_pki",
                    "container_last_used_at",
                    "etag",
                    "patch",
                    "usb_devices",
                    "https_allowed_credentials",
                    "image_compression_algorithm",
                    "directory_manipulation",
                    "container_cpu_time",
                    "storage_zfs_use_refquota",
                    "storage_lvm_mount_options",
                    "network",
                    "profile_usedby",
                    "container_push",
                    "container_exec_recording",
                    "certificate_update",
                    "container_exec_signal_handling",
                    "gpu_devices",
                    "container_image_properties",
                    "migration_progress",
                    "id_map",
                    "network_firewall_filtering",
                    "network_routes",
                    "storage",
                    "file_delete",
                    "file_append",
                    "network_dhcp_expiry",
                    "storage_lvm_vg_rename",
                    "storage_lvm_thinpool_rename",
                    "network_vlan",
                    "image_create_aliases",
                    "container_stateless_copy",
                    "container_only_migration",
                    "storage_zfs_clone_copy",
                    "unix_device_rename",
                    "storage_lvm_use_thinpool",
                    "storage_rsync_bwlimit",
                    "network_vxlan_interface",
                    "storage_btrfs_mount_options",
                    "entity_description",
                    "image_force_refresh",
                    "storage_lvm_lv_resizing",
                    "id_map_base",
                    "file_symlinks",
                    "container_push_target",
                    "network_vlan_physical",
                    "storage_images_delete",
                    "container_edit_metadata",
                    "container_snapshot_stateful_migration",
                    "storage_driver_ceph",
                    "storage_ceph_user_name",
                    "resource_limits",
                    "storage_volatile_initial_source",
                    "storage_ceph_force_osd_reuse",
                    "storage_block_filesystem_btrfs",
                    "resources",
                    "kernel_limits",
                    "storage_api_volume_rename",
                    "macaroon_authentication",
                    "network_sriov",
                    "console",
                    "restrict_devlxd",
                    "migration_pre_copy",
                    "infiniband",
                    "maas_network",
                    "devlxd_events",
                    "proxy",
                    "network_dhcp_gateway",
                    "file_get_symlink",
                    "network_leases",
                    "unix_device_hotplug",
                    "storage_api_local_volume_handling",
                    "operation_description",
                    "clustering",
                    "event_lifecycle",
                    "storage_api_remote_volume_handling",
                    "nvidia_runtime",
                    "container_mount_propagation",
                    "container_backup",
                    "devlxd_images",
                    "container_local_cross_pool_handling",
                    "proxy_unix",
                    "proxy_udp",
                    "clustering_join",
                    "proxy_tcp_udp_multi_port_handling",
                    "network_state",
                    "proxy_unix_dac_properties",
                    "container_protection_delete",
                    "unix_priv_drop",
                    "pprof_http",
                    "proxy_haproxy_protocol",
                    "network_hwaddr",
                    "proxy_nat",
                    "network_nat_order",
                    "container_full",
                    "candid_authentication",
                    "backup_compression",
                    "candid_config",
                    "nvidia_runtime_config",
                    "storage_api_volume_snapshots",
                    "storage_unmapped",
                    "projects",
                    "candid_config_key",
                    "network_vxlan_ttl",
                    "container_incremental_copy",
                    "usb_optional_vendorid",
                    "snapshot_scheduling",
                    "container_copy_project",
                    "clustering_server_address",
                    "clustering_image_replication",
                    "container_protection_shift",
                    "snapshot_expiry",
                    "container_backup_override_pool",
                    "snapshot_expiry_creation",
                    "network_leases_location",
                    "resources_cpu_socket",
                    "resources_gpu",
                    "resources_numa",
                    "kernel_features",
                    "id_map_current",
                    "event_location",
                    "storage_api_remote_volume_snapshots",
                    "network_nat_address",
                    "container_nic_routes",
                    "rbac",
                    "cluster_internal_copy",
                    "seccomp_notify",
                    "lxc_features",
                    "container_nic_ipvlan",
                    "network_vlan_sriov",
                    "storage_cephfs",
                    "container_nic_ipfilter",
                    "resources_v2",
                    "container_exec_user_group_cwd",
                    "container_syscall_intercept",
                    "container_disk_shift",
                    "storage_shifted",
                    "resources_infiniband",
                    "daemon_storage",
                    "instances",
                    "image_types",
                    "resources_disk_sata",
                    "clustering_roles",
                    "images_expiry",
                    "resources_network_firmware",
                    "backup_compression_algorithm",
                    "ceph_data_pool_name",
                    "container_syscall_intercept_mount",
                    "compression_squashfs",
                    "container_raw_mount",
                    "container_nic_routed",
                    "container_syscall_intercept_mount_fuse",
                    "container_disk_ceph",
                    "virtual-machines",
                    "image_profiles",
                    "clustering_architecture",
                    "resources_disk_id",
                    "storage_lvm_stripes",
                    "vm_boot_priority",
                    "unix_hotplug_devices",
                    "api_filtering"
            ],
            "api_status": "stable",
            "api_version": "1.0",
            "auth": "trusted",
            "public": false,
            "auth_methods": [
                    "tls"
            ],
            "environment": {
                    "addresses": [],
                    "architectures": [
                            "x86_64",
                            "i686"
                    ],
                    "certificate": "-----BEGIN CERTIFICATE-----\nMIICGjCCAZ+gAwIBAgIRALPrm3fUoK1uZPv0OdrPeWkwCgYIKoZIzj0EAwMwOzEc\nMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEbMBkGA1UEAwwScm9vdEBjaG9u\nZzYwMS1yNzEwMB4XDTIwMDIyMTEzMDYyNloXDTMwMDIxODEzMDYyNlowOzEcMBoG\nA1UEChMTbGludXhjb250YWluZXJzLm9yZzEbMBkGA1UEAwwScm9vdEBjaG9uZzYw\nMS1yNzEwMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEB3ZsNYjQp+0b0OXg7KzAQRMR\nVjrj1M9dituTb9KD93pjvd8jpnLlnciwTDZ+eT+rPcO51Fwq7cZB9w9FOQSrukJm\nkK8yC1oLMu/+9ACbyE+AgwP1LIUmhsb9X7lDcerOo2cwZTAOBgNVHQ8BAf8EBAMC\nBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAwBgNVHREEKTAn\ngg1jaG9uZzYwMS1yNzEwhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49\nBAMDA2kAMGYCMQCiY9wuQQQGtMIeK4Klscwrfrv6JCVdit5ojyOlm6iat8XwP0VG\nMaK213RC0aWaOjoCMQCiKTQpNZOq1cNl/o/wYih1juwCgsOMs0aUJCEtPgfek+SZ\nScsY+h56hUWpNffNzaE=\n-----END CERTIFICATE-----\n",
                    "certificate_fingerprint": "bf931f9fc0e0d9186f01d8580bcfe52175ffa3aa4368c03b78bb28e25bff168f",
                    "driver": "lxc",
                    "driver_version": "3.2.1",
                    "kernel": "Linux",
                    "kernel_architecture": "x86_64",
                    "kernel_features": {
                            "netnsid_getifaddrs": "false",
                            "seccomp_listener": "false",
                            "seccomp_listener_continue": "false",
                            "shiftfs": "false",
                            "uevent_injection": "false",
                            "unpriv_fscaps": "true"
                    },
                    "kernel_version": "4.15.0-88-generic",
                    "lxc_features": {
                            "cgroup2": "false",
                            "mount_injection_file": "true",
                            "network_gateway_device_route": "true",
                            "network_ipvlan": "true",
                            "network_l2proxy": "true",
                            "network_phys_macvlan_mtu": "true",
                            "network_veth_router": "true",
                            "seccomp_notify": "true"
                    },
                    "project": "default",
                    "server": "lxd",
                    "server_clustered": false,
                    "server_name": "chong601-r710",
                    "server_pid": 10106,
                    "server_version": "3.20",
                    "storage": "zfs",
                    "storage_version": "0.8.3-1"
            }
    }
DBUG[02-21|21:37:27] Sending request to LXD                   method=POST url=http://unix.socket/internal/containers etag=
DBUG[02-21|21:37:27]
    {
            "force": false,
            "name": "osm-postgres"
    }
Error: Create container: Invalid devices: Device validation failed "osm-postgres-cache": The "secondary-fio-cache" storage pool doesn't exist

 - [ ] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue)

location: none metadata: context: {} level: dbug message: 'New event listener: 2707f803-1c1d-41a5-9644-194e4c0a5b97' timestamp: "2020-02-21T21:35:37.914805666+08:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /1.0 user: "" level: dbug message: Handling timestamp: "2020-02-21T21:35:56.24631439+08:00" type: logging

location: none metadata: context: ip: '@' method: POST url: /internal/containers user: "" level: dbug message: Handling timestamp: "2020-02-21T21:35:56.249043934+08:00" type: logging

location: none metadata: context: {} level: dbug message: 'Database error: &errors.errorString{s:"No such object"}' timestamp: "2020-02-21T21:35:56.251948143+08:00" type: logging

stgraber commented 4 years ago

Yes, that's normal. lxd import only knows about instances, it doesn't know about custom storage volumes, networks, images, ...

Did that lxd import attempt re-create the storage pool itself (as in visible in lxc storage list)?

chong601 commented 4 years ago

The storage pool containing the container did get recreated, but not the other one (secondary-fio-cache)

+-----------------------+-------------+--------+--------------------------------+---------+
|         NAME          | DESCRIPTION | DRIVER |             SOURCE             | USED BY |
+-----------------------+-------------+--------+--------------------------------+---------+
| secondary-fio-storage |             | zfs    | secondary-fio-storage/lxd-area | 0       |
+-----------------------+-------------+--------+--------------------------------+---------+

stgraber commented 4 years ago

Yeah, that makes sense, the instance wasn't on the cache one so it didn't get re-created. Is that secondary disk on the instance on the cache pool or on the storage pool?

chong601 commented 4 years ago

Here's the layout of the container

secondary-fio-cache:
    - secondary-fio-cache/lxd-area/custom/osm-postgres-cache
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-cache/custom/osm-postgres-cache
    - secondary-fio-cache/lxd-area/custom/osm-postgres-index
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-cache/custom/osm-postgres-index
secondary-fio-storage:
    - secondary-fio-storage/lxd-area/containers/osm-postgres
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-storage/containers/osm-postgres
    - secondary-fio-storage/lxd-area/custom/osm-postgres-db-storage
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-storage/custom/osm-postgres-db-storage

(This is based on the zfs list output)

stgraber commented 4 years ago

Ok, so yeah, there's no way to automatically import this. Your best bet is to either restore a partial database backup or to manually re-add secondary-fio-cache to the database as well as manually re-adding the custom volumes to the storage_volumes table.

Once those are sorted, the containers themselves should import just fine.

I did put import/export infrastructure for custom storage volumes on our backlog, so in the near future there should be a better way to both handle disaster recovery for those as well as export them as tarballs similar to what containers support today.

chong601 commented 4 years ago

Ah well, I guess I have to pray that the snapshotted snap data for LXD are salvagable (this server got hit by issue #6913 and recently recovered by nuking everything) and planned to just reimport, but that doesn't work as lxd init --preseed doesn't allow creating storage pools with existing datasets.

The main take-away of this issue is this:

Not possible to create storage pools and volumes without digging through the database using tools like SQLite3
lxd init --preseed does not allow non-interactive for disaster recovery (cannot recreate pools with existing data inside)
(unrelated) --preseed doesn't accept file name as argument, you have to pass in using stdin (can create a new GitHub issue on this if needed)

This issue is okay to close as of now.

stgraber commented 4 years ago

In this case, a simple snap revert lxd would have gotten you on the previous working revision, that's usually worth a shot when something like this happen.

If something weird happens with the database, there's also always the option to revert it, either to the previous upgrade state (that's what the .bak files are) or by reverting just a few segments in the current database.

It's incredibly rare (as in, I've never seen it) that we can't recover a database. We've usually been able to provide a pretty quick fix or at least one of those revert methods usually work.

canonical / lxd

Unable to import containers with multiple storage pools attached #6914

Required information

Issue description

Steps to reproduce

Information to attach