azukiapp / azk

azk is a lightweight open source development environment orchestration tool. Instantly & safely run any environment on your local machine.
http://azk.io
Apache License 2.0
897 stars 63 forks source link

Azk agent not running #525

Closed teodor-pripoae closed 8 years ago

teodor-pripoae commented 8 years ago

I have a few running services inside azk vm and a few azk shells started, which I can access.

The http loadbalancer stopped working (connection reset by peer), and azk status says that agent is not running. However, the containers are open and running. Any ideas ? In the last few days I had a lot of problems with azk (last 2 versions, 0.14.4 and 0.15.0), the vm stopped completely on system sleep twice.

Before, on azk 0.12.1 everything was fine.

Are there any logs I can paste here ?

$ azk status
? The agent is not running, would you like to start it? No
azk: azk agent is required but is not running (try `azk agent status`)
fearenales commented 8 years ago

Hi @teodor-pripoae ! Thanks for your feedback!

Sorry about this. The root cause should be a routine inside azk that verifies if Docker daemon is up and, otherwise, stops azk agent.

This was included in #479 and #493.

The goal of those PRs was exactly the opposite: once the agent components wasn't properly working, the agent should shutdown itself and give the user a chance to bring it up again instead of receiving weird error messages.

For now, we recommend you to use azk agent with the option --no-daemon in a separated terminal tab. This way, if the agent stops, you can notice and start it again.

We'll prioritize this for solving.

teodor-pripoae commented 8 years ago

Hi,

Thanks for your fix !. How is the recommended way to do it ?

# Before I was doing:
$ azk agent start
$ azk start ....

# Now like this ?
$ azk agent --no-daemon # this will keep daemon in foreground ?
$ azk start ...

Is this related to the bug that was causing high cpu usage for the vm when monitoring docker ? It was fixed, since after upgrading my vm is using under 20% CPU with 31 services running, but it keeps stopping.

Can this issue be related to running a lot of services ? Does it hit some timeout when checking each services from docker ?

nuxlli commented 8 years ago

@fearenales in fact the current approach isn't the best. There are situations in which VirtualBox can suspend the VM for a while or Docker service can go down.

The flux should be modified to:

Important:

teodor-pripoae commented 8 years ago

@nuxlli Are there any places where I can patch and remove this checks until a patch is released ?

I never restart docker service inside the vm, and I can wait a little after I wake the system from sleep.

I don't know if this checks were back in 0.12.1, but I never had problem with docker restarting suddenly. I guess this checks are needed for linux version of azk, but where can I remove them temporarily until next release ?

fearenales commented 8 years ago

@teodor-pripoae Yes, --no-daemon will keep agent in foreground, so you'll need to use another terminal to run azk start.

Yes, we've fixed the high CPU usage in azk v0.14.5, but the main issue is the poor check strategy pointed by @nuxlli .

fearenales commented 8 years ago

@teodor-pripoae you can do the following patch:

After doing this, you should be able to run a make and use the azk file placed in the bin file in the azk project dir (it's a good idea to create a new alias for this an use it instead of normal azk meanwhile).

This should work and would be awesome if you could sent your solution as a Pull Request!

Any issue or concern, please let me know!

teodor-pripoae commented 8 years ago

Cool ! Thank you, I will try this in a few hours and submit a PR if it works :)

fearenales commented 8 years ago

Hey @teodor-pripoae , just checked out your PR. Great job! Did that solve your problem?

teodor-pripoae commented 8 years ago

Yes, azk didn't stopped the vm yet, so it seems to work. :)

fearenales commented 8 years ago

I've just run my batch of tests and everything seems ok.

teodor-pripoae commented 8 years ago

@fearenales

Btw, do you know why the test suite gives me this error ? Do I need linux ?

$ azk nvm npm test

> azk@0.15.0 test /Users/toni/code/gh/azk
> make test

task: test
/Users/toni/code/gh/azk/bin/azk nvm gulp test  ""
[16:48:30] Using gulpfile ~/code/gh/azk/gulpfile.js
make: *** [test] Error 1
npm ERR! Test failed.  See above for more details.
fearenales commented 8 years ago

@teodor-pripoae Use this to run the test suite:

$ azk nvm gulp test --slow

I'm sorry for not telling you before.

teodor-pripoae commented 8 years ago

Thanks!

Everything green,except port binding (I already had agent service binding on that port), and file syncing. But I guess the problems are elsewhere. Will investigate later, the main bug didn't happen for 12 hours, so I guess it's ok.

  375 passing (2m)
  1 pending
  4 failing

  1) Azk docker module, run method @slow should support bind ports:
     Error: HTTP code is 500 which indicates error: server error - Cannot start container 9d1ebc5102895ccfdc2dbef55883b986b1e7ffdd19d67449baa3fdb4bf46de40: Error starting userland proxy: listen tcp 0.0.0.0:32777: bind: address already in use

      at _stream_readable.js:944:16

2) Azk sync, Worker module should not include content patterns files from except_from option:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492228zn1s/bar/Fred.txt' not to match /bar\/Fred.txt/
      at Test.callee$1$1$ (/azk:0.15.0/spec/sync/worker_spec.js:216:27)

  3) Azk sync, Worker module should exclude the .gitignore content for default:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492n1w3zrk/ignored/Fred.txt' not to match /ignored\/Fred.txt/
      at Test.callee$1$2$ (/azk:0.15.0/spec/sync/worker_spec.js:241:27)

  4) Azk sync, Worker module should exclude the .syncignore content for default in preference to .gitignore:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492cqjv6qo/foo/Moe.txt' to match /ignored\/Fred.txt/
      at Test.callee$1$3$ (/azk:0.15.0/spec/sync/worker_spec.js:267:23)
fearenales commented 8 years ago

hmm.. which version of rsync are you using? Those errors are odd.

teodor-pripoae commented 8 years ago

Running on OSX. I didn't stop my services or vm when running tests, is this required ?

$ rsync --version
rsync  version 3.1.1  protocol version 31
Copyright (C) 1996-2014 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, no prealloc, file-flags
fearenales commented 8 years ago

Same thing happened at our CI box. I'm going to dig it deeper and ping you back ASAP.

teodor-pripoae commented 8 years ago

I ran it again and now only one error. It seems it is an intermittent error.

1) Azk sync, Worker module should not include content patterns files from except_from option:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-562383t7d56o/ignored/Fred.txt' not to match /ignored\/Fred.txt/
      at Test.callee$1$1$ (/azk:0.15.0/spec/sync/worker_spec.js:215:27)
fearenales commented 8 years ago

Yes, I've run it again in the CI box and everything passed... It's an intermittent error :/ Well, don't worry, I don't think your changes introduced that but I'll take a closer look on this when I have a chance.

Thank you very much for your PR, we do appreciate.

teodor-pripoae commented 8 years ago

Thank you for your help, too :)