flynn / flynn

[UNMAINTAINED] A next generation open source platform as a service (PaaS)
https://flynn.io
BSD 3-Clause "New" or "Revised" License
7.86k stars 594 forks source link

host: Running jobs stuck in "starting" state #907

Closed titanous closed 8 years ago

titanous commented 9 years ago
++ 23:11:26.811 /home/ubuntu/go/src/github.com/flynn/flynn/test/../cli/flynn-cli -a starlings-humble-arnprior scale --no-wait env=1
scaling env: 0=>1

++ 23:11:26.904 waiting for job events: map[env:map[up:1]]
++ 23:11:26.928 got job event: env d9adc29f-e8c86151cc70472c9b195a77fd4ab885 starting
test_cli.go:533:
    _, jobID := app.waitFor(jobEvents{"env": {"up": 1}})
test_scheduler.go:101:
    t.Fatal("timed out waiting for job events: ", expected)
... Error: timed out waiting for job events: map[env:map[up:1]]

https://s3.amazonaws.com/flynn-ci-logs/20150130231002-b003fe21-build-376e150ab4de7365232b447e3333080047a3d17c-2015-01-30-23-15-07.txt

titanous commented 9 years ago

Looks like e8c86151cc70472c9b195a77fd4ab885 got stuck in the starting state. containerinit Resume was called (or at least attempted), but no log messages after that.

titanous commented 9 years ago

Looks similar: https://s3.amazonaws.com/flynn-ci-logs/20150204072119-5e0e9d49-build-e74cfcb4b41b5525bd667cfba376e9387b98f9c2-2015-02-04-07-26-33.txt

titanous commented 9 years ago

Might be similar: https://s3.amazonaws.com/flynn-ci-logs/20150204183814-94124f24-build-5638a35ac18798e82fc95aeed0e1d98d2f4ab26b-2015-02-04-18-44-32.txt

lmars commented 9 years ago

Some similar failures:

In both cases, the jobs have output so are running.

lmars commented 9 years ago

Another example: https://ci.flynn.io/builds/20150311205929-abfc609f

jvatic commented 9 years ago

https://ci.flynn.io/builds/20150314014931-8761e211

titanous commented 9 years ago

Closing, this test failure does not appear in the CI logs dating back to May 1st.

lmars commented 8 years ago

This happened in the following build: https://ci.flynn.io/builds/20150928233104-066e9837

See job 99246a34-fd75e9ef-4988-4b9a-934f-34590f161fa9 which was stuck in the starting state, thus leading to TestKeyRotation timing out waiting for a deployment.

lmars commented 8 years ago

Although the underlying issue is not fixed, I've added a workaround in #2325.

Will re-open if I see this again.

dottodot commented 8 years ago

I'm suffering with this issue too and not sure if it's this that is preventing me from fixing my cluster. I've tried stopping them but has no effect

hosting2-3f1cff2c-dee4-4cae-8532-df8115c55b00  running   29 minutes ago  postgres            postgres
hosting2-10f6864f-f44b-4328-b76e-03569f227924  running   30 minutes ago  discoverd           app
hosting2-07c27155-a8bc-4a2f-816a-fbfc42b6c80d  running   30 minutes ago  flannel             app
hosting3-fcdfb15c-e13e-4f39-a847-e0c8c7d465ea  running   30 minutes ago  discoverd           app
hosting3-aa8f6864-5b9c-4be5-a3c0-1411608ed471  running   30 minutes ago  flannel             app
hosting1-7c89111f-debb-4d27-af60-e4ad0d4311ac  running   2 days ago      dottodot-prerender  web
hosting1-c63b5d0c-d69f-4510-86a9-92d1238c237a  running   2 days ago      controller          web
hosting1-5b6cb230-3a54-43b9-b2ce-75be5baafd47  running   2 days ago      dashboard           web
hosting2-79671455-737a-4307-add5-45dbe97a8fb0  running   2 days ago      router              app
hosting1-6dc45290-b9ab-4bc3-a98b-bcd38f4f2ccd  running   2 days ago      router              app
hosting3-acf80ff9-8fd6-46c2-8caa-cdd98ed91d04  running   2 days ago      router              app
hosting3-87bbc5c2-0495-48c2-9f22-15506335c4ee  running   2 days ago      controller          web
hosting1-a4d9a3ed-822c-4ba8-8c1c-24a321e273c0  starting                  postgres            postgres
hosting1-7df67f67-c3fc-4b99-8ded-4b4c48bd9f7c  starting                  postgres            postgres
hosting1-91caf687-b275-4b05-a013-6072685ab9e2  starting                  flannel             app
hosting1-48034bf6-70d9-4b38-90df-c9fc8803f5d8  starting                  postgres            postgres
hosting1-01b87eed-d71d-4415-8298-e41d072fa919  starting                  flannel             app
hosting1-5c91b206-979f-4c32-afb3-8dbeb1328943  starting                  postgres            postgres
hosting1-74dc555b-c382-47a1-b374-575507d6d24b  starting                  postgres            postgres
hosting1-b054022e-ceac-4c53-801e-6d5d1921c0a4  starting                  controller          scheduler
hosting1-3abe6980-d8dc-4a8f-bd31-e7f0370e9540  starting                  postgres            postgres
hosting1-43a3c4bb-989a-4931-9e66-e7605bd72b8a  starting                  postgres            postgres
hosting1-99ce7e93-ccd2-45be-99d7-4e85396b39f7  starting                  postgres            postgres
hosting1-66817d44-6732-4826-8f2c-1cdd9d3f8dcf  starting                  controller          scheduler
hosting1-6fc55d13-60a7-42e6-907e-c6752e0b5019  starting                  controller          scheduler
hosting1-c34831e8-9f8e-4fd9-a627-91632d1e3764  starting                  controller          scheduler
hosting1-c666b024-55ff-41b1-8c9c-25081a581364  starting                  controller          scheduler
hosting1-d54a72a9-91f0-4817-ada8-91236c483936  starting                  flannel             app
hosting1-aabf8bc9-a460-4d4d-b89f-a536d6eaedce  starting                  flannel             app
hosting1-31493324-0cd5-48f7-aeab-a60701d21af4  starting                  postgres            postgres
hosting1-34b10d88-a9d8-4512-99ec-a724c8f922e2  starting                  controller          scheduler
hosting1-97a46b49-862b-42d4-81b9-b02a96c5870b  starting                  postgres            postgres
lmars commented 8 years ago

@dottodot that looks like a potential scheduler issue, can you please run sudo flynn-host collect-debug-info on each node in your cluster, and then open a new issue with your above comment and the resulting gist links?