canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Unable to import containers with multiple storage pools attached #6914

Closed chong601 closed 4 years ago

chong601 commented 4 years ago

Required information

Issue description

As a preparation to avoid issue #6913, I have decided to downgrade from latest LXD to the 3.20. However, when I attempted to import the existing container after uninstalling LXD 3.21 and installing LXD 3.20, I get the following error: Error: Create container: Invalid devices: Device validation failed "osm-postgres-index": The "secondary-fio-cache" storage pool doesn't exist

Attempted fixes:

Only alternative is to lose out the dataset and recreate from scratch.

Let me know if require more information.

Steps to reproduce

  1. Install LXD 3.21
  2. Initialize LXD with ZFS pool
  3. Create at least one storage pool
  4. Create custom volume
  5. Create a container
  6. Attach the created custom volume to the container
  7. Shut down (or stop) container
  8. snap remove lxd
  9. snap install lxd --channel=3.20/stable
  10. Mount all datasets used by the container
    1. sudo lxd import <container>

Information to attach

 - [ ] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue)

location: none metadata: context: {} level: dbug message: 'New event listener: 2707f803-1c1d-41a5-9644-194e4c0a5b97' timestamp: "2020-02-21T21:35:37.914805666+08:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /1.0 user: "" level: dbug message: Handling timestamp: "2020-02-21T21:35:56.24631439+08:00" type: logging

location: none metadata: context: ip: '@' method: POST url: /internal/containers user: "" level: dbug message: Handling timestamp: "2020-02-21T21:35:56.249043934+08:00" type: logging

location: none metadata: context: {} level: dbug message: 'Database error: &errors.errorString{s:"No such object"}' timestamp: "2020-02-21T21:35:56.251948143+08:00" type: logging

stgraber commented 4 years ago

Yes, that's normal. lxd import only knows about instances, it doesn't know about custom storage volumes, networks, images, ...

Did that lxd import attempt re-create the storage pool itself (as in visible in lxc storage list)?

chong601 commented 4 years ago

The storage pool containing the container did get recreated, but not the other one (secondary-fio-cache)

+-----------------------+-------------+--------+--------------------------------+---------+
|         NAME          | DESCRIPTION | DRIVER |             SOURCE             | USED BY |
+-----------------------+-------------+--------+--------------------------------+---------+
| secondary-fio-storage |             | zfs    | secondary-fio-storage/lxd-area | 0       |
+-----------------------+-------------+--------+--------------------------------+---------+
stgraber commented 4 years ago

Yeah, that makes sense, the instance wasn't on the cache one so it didn't get re-created. Is that secondary disk on the instance on the cache pool or on the storage pool?

chong601 commented 4 years ago

Here's the layout of the container

secondary-fio-cache:
    - secondary-fio-cache/lxd-area/custom/osm-postgres-cache
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-cache/custom/osm-postgres-cache
    - secondary-fio-cache/lxd-area/custom/osm-postgres-index
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-cache/custom/osm-postgres-index
secondary-fio-storage:
    - secondary-fio-storage/lxd-area/containers/osm-postgres
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-storage/containers/osm-postgres
    - secondary-fio-storage/lxd-area/custom/osm-postgres-db-storage
        mounted on /var/snap/lxd/common/lxd/storage-pools/secondary-fio-storage/custom/osm-postgres-db-storage

(This is based on the zfs list output)

stgraber commented 4 years ago

Ok, so yeah, there's no way to automatically import this. Your best bet is to either restore a partial database backup or to manually re-add secondary-fio-cache to the database as well as manually re-adding the custom volumes to the storage_volumes table.

Once those are sorted, the containers themselves should import just fine.

I did put import/export infrastructure for custom storage volumes on our backlog, so in the near future there should be a better way to both handle disaster recovery for those as well as export them as tarballs similar to what containers support today.

chong601 commented 4 years ago

Ah well, I guess I have to pray that the snapshotted snap data for LXD are salvagable (this server got hit by issue #6913 and recently recovered by nuking everything) and planned to just reimport, but that doesn't work as lxd init --preseed doesn't allow creating storage pools with existing datasets.

The main take-away of this issue is this:

This issue is okay to close as of now.

stgraber commented 4 years ago

In this case, a simple snap revert lxd would have gotten you on the previous working revision, that's usually worth a shot when something like this happen.

If something weird happens with the database, there's also always the option to revert it, either to the previous upgrade state (that's what the .bak files are) or by reverting just a few segments in the current database.

It's incredibly rare (as in, I've never seen it) that we can't recover a database. We've usually been able to provide a pretty quick fix or at least one of those revert methods usually work.