canonical / pylxd

Python module for LXD
https://pylxd.readthedocs.io/en/latest/
Apache License 2.0
256 stars 134 forks source link

Not able to issue migration #315

Open gabrik opened 6 years ago

gabrik commented 6 years ago

Hi everyone,

I'm trying to Live Migrate a very simple container (an empty alpine/edge one) though two different machines If I use the LXD cli interface the migration works pretty well But if I try using pylxd API I got some errors, the code is pretty simple:

from pylxd import Client

#client_local = Client()

client_local=Client(endpoint='https://127.0.0.1:8443',cert=('lxd.crt','lxd.key'),verify=False)
client_local.authenticate('fog')
client_remote=Client(endpoint='https://192.168.1.106:8443',cert=('lxd.crt','lxd.key'),verify=False)
print('Trusted {}'.format(client_remote.trusted))
client_remote.authenticate('fog')
print('Trusted {}'.format(client_remote.trusted))

cont = client_local.containers.get('testm')
cont.migrate(client_remote,wait=True)

I'm using endpoint also for localhost otherwise I get

ValueError: Cannot migrate from a local client connection

The error that I get is this one:

Traceback (most recent call last):
  File "migration.py", line 13, in <module>
    cont.migrate(client_remote,wait=True)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pylxd/models/container.py", line 267, in migrate
    self.generate_migration_data(), wait=wait)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pylxd/models/container.py", line 278, in generate_migration_data
    operation = self.client.operations.get(response.json()['operation'])
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pylxd/models/operation.py", line 38, in get
    return cls(_client=client, **response.json()['metadata'])
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pylxd/models/operation.py", line 43, in __init__
    setattr(self, key, value)
AttributeError: 'Operation' object has no attribute 'description'

I'm misusing the API?

ajkavanagh commented 6 years ago

Hi @gabrik

You're not misusing the API; unfortunately the version of pylxd you are using is not compatible with the version of lxd you are using. The AttributeError issue is fixed in pylxd 2.0.7 (for xenial) and 2.2.5+ for bionic.

Which version of lxd are you using (lxd --version) and which version of pylxd? (pylxd.__version__)?

gabrik commented 6 years ago

Hi @ajkavanagh

I'm using lxd version 3.2 and pylxd 2.2.4 on Xenial I think I have to update to fix this

ajkavanagh commented 6 years ago

Yes, I'm pretty sure that 2.2.5 has a fix to this; latest is 2.2.7, but it as an execute issue that I'm looking into that became much more apparent with lxd 3.0.2. lxd 3.2 has less of an issue, but if you use lots of 'execute' statements from pylxd, then 2.2.6 might be better until I fix the issue and get 2.2.8 out.

gabrik commented 6 years ago

Thanks @ajkavanagh Now I see the migration starting but then I got an error on the websockets:

Traceback (most recent call last):
  File "migration.py", line 13, in <module>
    cont.migrate(client_remote,wait=True)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 318, in migrate
    self.generate_migration_data(), wait=wait)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 184, in create
    client.operations.wait_for_operation(response.json()['operation'])
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 32, in wait_for_operation
    operation.wait()
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 63, in wait
    raise exceptions.LXDAPIException(response)
pylxd.exceptions.LXDAPIException: Error transferring container data: websocket: bad handshake

Both nodes have the other one in the list of the remote, but I don't think this is mandatory

ajkavanagh commented 6 years ago

@gabrik The error is actually coming from LXD (i.e. it is being passed through pylxd directly from LXD). However, it could be because pylxd is sending the wrong data, or lxd is not happy with the chatting to the other lxd server.

Can you look at versions on the two servers? Are they mostly compatible?

gabrik commented 6 years ago

Thanks @ajkavanagh the version of LXD is the same in both servers (3.2) installed using snap same as pylxd which is version 2.2.6 on both, I can try using different versions of pylxd

In /var/snap/lxd/common/lxd/lxd.log I see this when I issue the migration

destination node:

ephemeral=false lvl=info msg="Creating container" name=testm t=2018-07-06T12:55:08+0000
ephemeral=false lvl=info msg="Created container" name=testm t=2018-07-06T12:55:08+0000
lvl=warn msg="Unable to update backup.yaml at this time" name=testm t=2018-07-06T12:55:08+0000
lvl=warn msg="Unable to update backup.yaml at this time" name=testm t=2018-07-06T12:55:08+0000
err="websocket: bad handshake" lvl=eror msg="Error during migration sink" t=2018-07-06T12:55:08+0000
created=2018-07-06T12:55:08+0000 ephemeral=false lvl=info msg="Deleting container" name=testm t=2018-07-06T12:55:08+0000 used=1970-01-01T00:00:00+0000
created=2018-07-06T12:55:08+0000 ephemeral=false lvl=info msg="Deleted container" name=testm t=2018-07-06T12:55:08+0000 used=1970-01-01T00:00:00+0000

Nothing is written in the source node

obviously if I try to sniff the traffic between the two nodes to see the messages I cannot get anything because it is encrypted

gabrik commented 6 years ago

@ajkavanagh I have found I way to make this work, I have to use the correct ip address also when connecting to the source node, the address should be reachable from the destination node. So I have update my code according to this:

from pylxd import Client

#client_local = Client()

client_local=Client(endpoint='https://192.168.1.104:8443',cert=('lxd.crt','lxd.key'),verify=False)
client_local.authenticate('fog')
client_remote=Client(endpoint='https://192.168.1.106:8443',cert=('lxd.crt','lxd.key'),verify=False)
print('Trusted {}'.format(client_remote.trusted))
client_remote.authenticate('fog')
print('Trusted {}'.format(client_remote.trusted))

cont = client_local.containers.get('testm')
cont.migrate(client_remote,wait=True)

I always got this:

Traceback (most recent call last):
  File "migration.py", line 13, in <module>
    cont.migrate(client_remote,wait=True)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 426, in migrate
    self.generate_migration_data(), wait=wait)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 265, in create
    client.operations.wait_for_operation(response.json()['operation'])
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 32, in wait_for_operation
    operation.wait()
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 63, in wait
    raise exceptions.LXDAPIException(response)
pylxd.exceptions.LXDAPIException: The container is already running

and the container on the source node remain stopped, I think is because the migration method ends due to the exception above

I think this should be well documented, can I add this example to the documentation?

ajkavanagh commented 6 years ago

I think this should be well documented, can I add this example to the documentation?

@gabrik yes, please! Please do submit a PR for any changes/additions to the documentation that you think might be helpful!

gabrik commented 6 years ago

@ajkavanagh

Even if the migration works I got this error:

Traceback (most recent call last):
  File "migration.py", line 13, in <module>
    cont.migrate(client_remote,wait=True)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 426, in migrate
    self.generate_migration_data(), wait=wait)
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/container.py", line 265, in create
    client.operations.wait_for_operation(response.json()['operation'])
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 32, in wait_for_operation
    operation.wait()
  File "/usr/local/lib/python3.5/dist-packages/pylxd/models/operation.py", line 63, in wait
    raise exceptions.LXDAPIException(response)
pylxd.exceptions.LXDAPIException: The container is already running

I think that cames from some reason from here https://github.com/lxc/pylxd/blob/master/pylxd/models/container.py#L425

For what I understand if the container is migrated when is running, the create tries to set it running again and so it returns this error. I am correct? You think that adding a try...except can solve this?

ajkavanagh commented 5 years ago

I'm closing this as I think it is largely answered/covered by #319. Please feel free to re-open it if there is still an issue NOT related to the actual migration code.

Oussalvatore commented 5 years ago

Hi, I'm still facing this issue with LXD 3.0.3 LTS and the latest version of pylxd running in Ubuntu 18.04.

ajkavanagh commented 5 years ago

@Oussalvatore Please could you:

a) Indicate the exact version of pylxd b) paste the error message/traceback here please.

Thanks very much.

Oussalvatore commented 5 years ago

@ajkavanagh Sorry for the late reply,

a) Since I thought that the new "live" parameter will solve the problem, currently I'm using pylxd==2.2.10.dev17, however I had the same problem with the version 2.2.9, and I'have just tried the newly released version 2.2.10 but the problem still there.

b) 1- The error message is:

Common start logic: The container is already running

2- The execution traceback is:

    Traceback (most recent call last):
  File "/home/ubuntu/drones-controllers-manager/lxd_manager.py", line 137, in migrate_ca
    c.migrate(destination,live=True, wait=True)
  File "/usr/local/lib/python3.6/dist-packages/pylxd/models/container.py", line 525, in migrate
    raise e
  File "/usr/local/lib/python3.6/dist-packages/pylxd/models/container.py", line 519, in migrate
    self.generate_migration_data(live), wait=wait)
  File "/usr/local/lib/python3.6/dist-packages/pylxd/models/container.py", line 279, in create
    client.operations.wait_for_operation(response.json()['operation'])
  File "/usr/local/lib/python3.6/dist-packages/pylxd/models/operation.py", line 47, in wait_for_operation
    operation.wait()
  File "/usr/local/lib/python3.6/dist-packages/pylxd/models/operation.py", line 88, in wait
    raise exceptions.LXDAPIException(response)
pylxd.exceptions.LXDAPIException: Common start logic: The container is already running

Also, bellow you'll find the output of "lxc monitor" rununnig in the destination host:

metadata:
  context: {}
  level: dbug
  message: 'New event listener: 94ad2e05-f594-4fbf-840e-1242478114cf'
timestamp: "2019-06-05T08:47:29.999465304Z"
type: logging

metadata:
  context:
    k: "1"
  level: dbug
  message: Found cert
timestamp: "2019-06-05T08:47:47.734396336Z"
type: logging

metadata:
  context:
    ip: 195.148.125.92:46864
    method: GET
    url: /1.0
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:47.734477485Z"
type: logging

metadata:
  context:
    k: "1"
  level: dbug
  message: Found cert
timestamp: "2019-06-05T08:47:47.735176399Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Responding to container create
timestamp: "2019-06-05T08:47:47.945866934Z"
type: logging

metadata:
  context:
    k: "1"
  level: dbug
  message: Found cert
timestamp: "2019-06-05T08:47:47.945806315Z"
type: logging

metadata:
  context:
    ip: 195.148.125.92:46874
    method: POST
    url: /1.0/containers
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:47.945851191Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: No valid storage pool in the container's local root disk device and profiles
    found
timestamp: "2019-06-05T08:47:47.947067499Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: 'Database error: &errors.errorString{s:"sql: no rows in result set"}'
timestamp: "2019-06-05T08:47:47.947529002Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: 'Database error: &errors.errorString{s:"sql: no rows in result set"}'
timestamp: "2019-06-05T08:47:47.948167633Z"
type: logging

metadata:
  context:
    ephemeral: "false"
    name: first
  level: info
  message: Creating container
timestamp: "2019-06-05T08:47:47.984940899Z"
type: logging

metadata:
  action: container-updated
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:48.102538242Z"
type: lifecycle

metadata:
  action: container-updated
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:48.202111093Z"
type: lifecycle

metadata:
  context:
    ephemeral: "false"
    name: first
  level: info
  message: Created container
timestamp: "2019-06-05T08:47:48.205872819Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Creating empty ZFS storage volume for container "first" on storage pool
    "default"
timestamp: "2019-06-05T08:47:48.205899131Z"
type: logging

metadata:
  action: container-created
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:48.205888015Z"
type: lifecycle

metadata:
  context: {}
  level: dbug
  message: Mounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.608795363Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Created empty ZFS storage volume for container "first" on storage pool
    "default"
timestamp: "2019-06-05T08:47:48.608738233Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.61076933Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.719152991Z"
type: logging

metadata:
  action: container-updated
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:48.719140025Z"
type: lifecycle

metadata:
  context:
    name: first
  level: warn
  message: Unable to update backup.yaml at this time
timestamp: "2019-06-05T08:47:48.719118048Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.739067287Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.739037258Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.741146888Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.741026052Z"
type: logging

metadata:
  context:
    name: first
  level: warn
  message: Unable to update backup.yaml at this time
timestamp: "2019-06-05T08:47:48.741129721Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:48.763349767Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: 'Started task operation: 9c18a13b-31e4-4743-8a1b-498b7a53d88d'
timestamp: "2019-06-05T08:47:48.825045802Z"
type: logging

metadata:
  class: task
  created_at: "2019-06-05T08:47:48.801416962Z"
  description: Creating container
  err: ""
  id: 9c18a13b-31e4-4743-8a1b-498b7a53d88d
  may_cancel: false
  metadata: null
  resources:
    containers:
    - /1.0/containers/first
  status: Running
  status_code: 103
  updated_at: "2019-06-05T08:47:48.801416962Z"
timestamp: "2019-06-05T08:47:48.825054065Z"
type: operation

metadata:
  class: task
  created_at: "2019-06-05T08:47:48.801416962Z"
  description: Creating container
  err: ""
  id: 9c18a13b-31e4-4743-8a1b-498b7a53d88d
  may_cancel: false
  metadata: null
  resources:
    containers:
    - /1.0/containers/first
  status: Pending
  status_code: 105
  updated_at: "2019-06-05T08:47:48.801416962Z"
timestamp: "2019-06-05T08:47:48.825018853Z"
type: operation

metadata:
  context: {}
  level: dbug
  message: 'New task operation: 9c18a13b-31e4-4743-8a1b-498b7a53d88d'
timestamp: "2019-06-05T08:47:48.824995306Z"
type: logging

metadata:
  context:
    k: "1"
  level: dbug
  message: Found cert
timestamp: "2019-06-05T08:47:48.857158296Z"
type: logging

metadata:
  context:
    ip: 195.148.125.92:46876
    method: GET
    url: /1.0/operations/9c18a13b-31e4-4743-8a1b-498b7a53d88d
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:48.857176342Z"
type: logging

metadata:
  context:
    ip: 195.148.125.92:46878
    method: GET
    url: /1.0/operations/9c18a13b-31e4-4743-8a1b-498b7a53d88d/wait
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:48.889217113Z"
type: logging

metadata:
  context:
    k: "1"
  level: dbug
  message: Found cert
timestamp: "2019-06-05T08:47:48.889199496Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: About to receive rsync
timestamp: "2019-06-05T08:47:48.908377944Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: got message barrier
timestamp: "2019-06-05T08:47:49.337505622Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: sending write barrier
timestamp: "2019-06-05T08:47:49.506205757Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Got message barrier, resetting stream
timestamp: "2019-06-05T08:47:49.506939152Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Done receiving from rsync
timestamp: "2019-06-05T08:47:49.506976482Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: About to receive header
timestamp: "2019-06-05T08:47:49.507187591Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: About to receive rsync
timestamp: "2019-06-05T08:47:49.508017479Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: sending write barrier
timestamp: "2019-06-05T08:47:49.645647764Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Done receiving from rsync
timestamp: "2019-06-05T08:47:49.646216976Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Got message barrier, resetting stream
timestamp: "2019-06-05T08:47:49.646177408Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: About to receive header
timestamp: "2019-06-05T08:47:49.64640338Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: sending write barrier
timestamp: "2019-06-05T08:47:51.091683528Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Got message barrier, resetting stream
timestamp: "2019-06-05T08:47:51.092271262Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: got message barrier
timestamp: "2019-06-05T08:47:51.203550825Z"
type: logging

metadata:
  context:
    actionscript: "false"
    created: 2019-06-05 08:47:47 +0000 UTC
    ephemeral: "false"
    features: "0"
    name: first
    predumpdir: ""
    statedir: /tmp/lxd_restore_918693111
    stop: "false"
    used: 1970-01-01 00:00:00 +0000 UTC
  level: info
  message: Migrating container
timestamp: "2019-06-05T08:47:51.638953976Z"
type: logging

metadata:
  action: container-updated
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:51.677889197Z"
type: lifecycle

metadata:
  context: {}
  level: dbug
  message: Mounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.679747414Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.679553975Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.682401055Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.798056689Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.831314572Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.833476654Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.83733693Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Unmounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.930825879Z"
type: logging

metadata:
  context:
    ip: '@'
    method: GET
    url: /1.0
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:51.951167443Z"
type: logging

metadata:
  context:
    ip: '@'
    method: GET
    url: /internal/containers/78/onstart
  level: dbug
  message: handling
timestamp: "2019-06-05T08:47:51.953908613Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounting ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.957812509Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: Mounted ZFS storage volume for container "first" on storage pool "default"
timestamp: "2019-06-05T08:47:51.96214717Z"
type: logging

metadata:
  context: {}
  level: dbug
  message: 'Scheduler: container first started: re-balancing'
timestamp: "2019-06-05T08:47:52.122993177Z"
type: logging

metadata:
  context:
    actionscript: "false"
    created: 2019-06-05 08:47:47 +0000 UTC
    ephemeral: "false"
    features: "0"
    name: first
    predumpdir: ""
    statedir: /tmp/lxd_restore_918693111
    stop: "false"
    used: 1970-01-01 00:00:00 +0000 UTC
  level: info
  message: Migrated container
timestamp: "2019-06-05T08:47:52.708049284Z"
type: logging

metadata:
  action: container-updated
  source: /1.0/containers/first
timestamp: "2019-06-05T08:47:52.801985601Z"
type: lifecycle

metadata:
  context: {}
  level: dbug
  message: 'Failure for task operation: 9c18a13b-31e4-4743-8a1b-498b7a53d88d: Common
    start logic: The container is already running'
timestamp: "2019-06-05T08:47:52.802075959Z"
type: logging

metadata:
  class: task
  created_at: "2019-06-05T08:47:48.801416962Z"
  description: Creating container
  err: 'Common start logic: The container is already running'
  id: 9c18a13b-31e4-4743-8a1b-498b7a53d88d
  may_cancel: false
  metadata: null
  resources:
    containers:
    - /1.0/containers/first
  status: Failure
  status_code: 400
  updated_at: "2019-06-05T08:47:48.801416962Z"
timestamp: "2019-06-05T08:47:52.802094625Z"
type: operation

Thank you :)

ajkavanagh commented 5 years ago

I've checked the code, and it's not obvious what the error might be. Which version of LXD are you using (on both source and target). Are they clustered (i.e. via LXD clustering) vs. standalone?

The trace seems to indicate that the container was already started on the target? Is it a naming issue? (i.e. could there have already been a container with the same name on the target)?

Oussalvatore commented 5 years ago

@ajkavanagh I'm using LXD 3.0.3 LTS in standalone mode. I'm sure that the container doesn't exist in the target host. It would be great if you can reproduce the error with the LTS version of LXD.

ajkavanagh commented 5 years ago

@Oussalvatore thanks for the additional info. I'll have to sort out a lab for testing this (it's re-purposed at present), so any additional debug you can offer would be useful. e.g. does it work with straight lxc commands at the CLI, etc.