NixOS / nixops-aws

GNU Lesser General Public License v3.0
52 stars 39 forks source link

nixops destroy frequently fails #58

Open andy-dean opened 6 years ago

andy-dean commented 6 years ago

I occasionally get an error when running "nixops destroy" - probably 10-20% of the times I run "nixops destroy". When the error happens, I end up with a volume that is no longer attached to any EC2 instance.

Here's the command I use:

nixops destroy -d some-deploy --confirm

And here is the console output I see:

warning: are you sure you want to destroy EC2 machine ‘machine’? (y/N) y
machine> destroying EC2 machine... [shutting-down] [shutting-down] [shutting-down] [shutting-down] Traceback (most recent call last):
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/bin/..nixops-wrapped-wrapped", line 951, in <module>
    args.op()
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/bin/..nixops-wrapped-wrapped", line 400, in op_destroy
    wipe=args.wipe)
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/deployment.py", line 1073, in destroy_resources
    self._destroy_resources(include, exclude, wipe)
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/deployment.py", line 1067, in _destroy_resources
    nixops.parallel.run_tasks(nr_workers=-1, tasks=self.resources.values(), worker_fun=worker)
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/parallel.py", line 41, in thread_fun
    result_queue.put((worker_fun(t), None))
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/deployment.py", line 1060, in worker
    if m.destroy(wipe=wipe): self.delete_resource(m)
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/backends/ec2.py", line 1257, in destroy
    instance = self._get_instance(update=True)
  File "/nix/store/0h2c0k9mr8y5pvjd3ml30ms5rdf4kia1-nixops-1.5.1/lib/python2.7/site-packages/nixops/backends/ec2.py", line 285, in _get_instance
    assert instance_id
AssertionError
coretemp commented 6 years ago

From what I can see, self.vm_id is probably None causing the assertion to fail. The root cause of the issue is that there is no documentation for vm_id. Additionally, it appears that nixops assumes that AWS APIs return an answer every single time, which is not the case.

If the size of your nixops deployment grows towards thousands of machines, the probability of a failed deployment will go to 1.

AWS APIs are rate limited, but there is nothing in nixops that tries to cope with failure. The automation in nixops seems to be limited currently, because of its many failure modes.