celery tasks on beta are having the connection reset by peer

btbonval commented 9 years ago

Staging system is getting connection reset by peer on tweet_note and update-scoreboard.

I thought twitter was causing problems, but update-scoreboard does not reach out to any external servers besides the database.

For update-scoreboard and tweet_note, the problem occurs in Kombu/connection which occurs in AMQP/connection. This would seem to indicate that the AMQP cloud connection isn't working.

[2015-01-13 14:43:22,006: INFO/MainProcess] Scheduler: Sending due task update-scoreboard (fix_note_counts)
[2015-01-13 14:43:34,880: ERROR/MainProcess] Message Error: Couldn't apply scheduled task update-scoreboard: [Errno 104] Connection reset by peer
['  File "manage.py", line 14, in <module>\n    execute_from_command_line(sys.argv)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 453, in execute_from_command_line\n    utility.execute()\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute\n    self.fetch_command(subcommand).run_from_argv(self.argv)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/djcelery/management/commands/celery.py", line 22, in run_from_argv\n    [\'%s %s\' % (argv[0], argv[1])] + argv[2:],\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 901, in execute_from_commandline\n    super(CeleryCommand, self).execute_from_commandline(argv)))\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/base.py", line 187, in execute_from_commandline\n    return self.handle_argv(prog_name, argv[1:])\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 893, in handle_argv\n    return self.execute(command, argv)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 868, in execute\n    return cls(app=self.app).run_from_argv(self.prog_name, argv)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 148, in run_from_argv\n    return self(*args, **options)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 118, in __call__\n    ret = self.run(*args, **kwargs)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celery.py", line 220, in run\n    return self.target.run(*args, **kwargs)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/bin/celerybeat.py", line 67, in run\n    return beat().run()\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/apps/beat.py", line 72, in run\n    self.start_scheduler()\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/apps/beat.py", line 101, in start_scheduler\n    beat.start()\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/beat.py", line 429, in start\n    interval = self.scheduler.tick()\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/beat.py", line 199, in tick\n    next_time_to_run = self.maybe_due(entry, self.publisher)\n', '  File "/app/.heroku/python/lib/python2.7/site-packages/celery/beat.py", line 185, in maybe_due\n    exc, traceback.format_stack(), exc_info=True)\n']
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/beat.py", line 182, in maybe_due
    result = self.apply_async(entry, publisher=publisher)
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/beat.py", line 226, in apply_async
    **entry.options)
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/app/task.py", line 474, in apply_async
    **options)
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/app/amqp.py", line 250, in publish_task
    **kwargs
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 164, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensured
    interval_max)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 396, in ensure_connection
    interval_start, interval_step, interval_max, callback)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/utils/__init__.py", line 217, in retry_over_time
    return fun(*args, **kwargs)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 246, in connect
    return self.connection
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 761, in connection
    self._connection = self._establish_connection()
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 720, in _establish_connection
    conn = self.transport.establish_connection()
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 115, in establish_connection
    conn = self.Connection(**opts)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 154, in __init__
    return self._x_open(virtual_host)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 579, in _x_open
    (10, 41),    # Connection.open_ok
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait
    self.channel_id, allowed_methods)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 204, in _wait_method
    self.method_reader.read_method()
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method
    raise m
SchedulingError: Couldn't apply scheduled task update-scoreboard: [Errno 104] Connection reset by peer

[2015-01-13 14:44:27,448: INFO/MainProcess] Got task from broker: tweet_note[436bf8f0-28ab-4109-9ccb-a7c994f62389]
[2015-01-13 14:44:28,268: CRITICAL/MainProcess] Task karmaworld.apps.document_upload.tasks.process_raw_document[e80407d0-f339-4ff4-8c25-4ccc2309a2b0] INTERNAL ERROR: error(104, 'Connection reset by peer')
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/task/trace.py", line 268, in trace_task
    store_result(uuid, retval, SUCCESS)
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/backends/base.py", line 282, in store_result
    self._store_result(task_id, result, status, traceback, **kwargs)
  File "/app/.heroku/python/lib/python2.7/site-packages/celery/backends/amqp.py", line 134, in _store_result
    declare=[self._create_binding(task_id)])
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/messaging.py", line 164, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensured
    interval_max)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 396, in ensure_connection
    interval_start, interval_step, interval_max, callback)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/utils/__init__.py", line 217, in retry_over_time
    return fun(*args, **kwargs)
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 246, in connect
    return self.connection
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 761, in connection
    self._connection = self._establish_connection()
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/connection.py", line 720, in _establish_connection
    conn = self.transport.establish_connection()
  File "/app/.heroku/python/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 115, in establish_connection
    conn = self.Connection(**opts)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 154, in __init__
    return self._x_open(virtual_host)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 579, in _x_open
    (10, 41),    # Connection.open_ok
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait
    self.channel_id, allowed_methods)
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/connection.py", line 204, in _wait_method
    self.method_reader.read_method()
  File "/app/.heroku/python/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method
    raise m
error: [Errno 104] Connection reset by peer

btbonval commented 9 years ago

Staging and Production both use the same remote CloudAMQP URL. Production does not seem to have any issues. Staging does.

They run against a Tiger CloudAMQP instance.

btbonval commented 9 years ago

Might have to do with the concurrent connections, by tweaking BROKER_POOL_LIMIT, which is not set for either of Staging or Prod. https://devcenter.heroku.com/articles/cloudamqp#celery

Tiger should support up to 12 concurrent connections, though. It seems unlikely prod is using all 12 connections and beta can't even get in there for a moment. https://www.cloudamqp.com/plans.html

I do not have account access to Finals Club's CloudAMQP to check any further.

btbonval commented 9 years ago

Possibly a permissions problem. http://stackoverflow.com/a/26629093/1867779

Possibly a domain name change problem. In this case, if I understand the problem, CloudAMQP might have changed the connection information, but production has no issue because its environment hasn't been reset? I don't know why that would be the case. http://community.spiceworks.com/topic/683143-zenoss-4-2-5-jobs-never-move-beyond-pending

btbonval commented 9 years ago

It looks like we've hit max connections. According to CloudAMQP:

Open connections    13 of 12

When you've reached the maximum concurrent connections further connections will be prohibited. You can connect again when you're under the limit.

Unfortunately when you've reached the maximum concurrent connections you can't access the management interface either.

So I cannot access the management console, but the problem seems quite clear.

I'll have to try to limit the number of connections from prod to 8 or 10.

btbonval commented 9 years ago

CloudAMQP actually let me connect to the management interface anyway. This is the interface supplied: https://www.rabbitmq.com/management.html

13 connections, 14 channels, 0 exchanges and 0 queues. From the graphs, it looks like there have been no messages being passed for awhile.

I think that our plan used to allow more connections (like 15 or 18). I'm also thinking we've held 13 connections since before they changed it to 12.

It looks like you can click on a connection's name for more info and then close it in that interface. Going to close up some connections and see if more connections are created automatically by the production server.

btbonval commented 9 years ago

Beta's queue suddenly showed up on the management console, so that seems to be capable of acting now. 9000 messages on the queue waiting to be processed (wow!).

I closed a few more connections so beta can get in there and parse that stuff more quickly.

This is the documentation for Celery 3.0.25. Looks like we've locked python-celery to 3.0.21, so that's close. http://docs.celeryproject.org/en/3.0/configuration.html

Looks like we're configuring Celery here: https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/settings/common.py#L323-L331

and here with some values overridden: https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/settings/prod.py#L58-L97

Strange that we had 13 connections and beta could not connect at all: "The pool is enabled by default since version 2.5, with a default limit of ten connections." Furthermore, we define BROKER_POOL_LIMIT in the configuration to just 1 connection. No reason we should have hit 13 connections!

I notice mongo and redis celery backends have specific connection limits, but the amqp backend does not.

Someone else having this problem, here's an SO with no answer: http://stackoverflow.com/questions/23249850/celery-cloudamqp-creates-new-connection-for-each-task

Someone mentioned BROKER_POOL_LIMIT not working for them with CloudAMQP, but setting it to 0 was alright. The question specifically asked about Redis, though. http://stackoverflow.com/a/23563018/1867779

So it looks like we might want to set BROKER_POOL_LIMIT=0 and then connections will be made and lost as necessary. It seems like the utility of the connection is transient if watching the stats has been any indication.

btbonval commented 9 years ago

I'm running celery. I see connections being created from beta's celery now. Some are refused due to the max limit being reached, but connections are definitely happening. I'm watching CloudAMQP and seeing between 11 and 13 connections (up from 9 left by production). I don't see the queue size decreasing much from 9000some.

I'm seeing literally thousands of weirdly named queues being created and I don't know why. 2500 and increasing uniquely named queues. Before, there were 4: 2 weird names and 2 human names.

btbonval commented 9 years ago

Ah but the score board just updated, adding a new school. That would seem to indicate the scoreboard update celery task is running, for #377. It would also seem to indicate celery is working well enough to close this ticket.

btbonval commented 9 years ago

This issue remains closed, but as I investigate some AMQP stuff, I'll add notes to this ticket for reference.

There are now only 12 queues running.

3 have names of the form UUID.celery.pidbox. It looks like each queue with that name is a "fan exchange" created by each worker. or something. Rarely they switch from idle to running. http://hustoknow.blogspot.com/2011/11/how-celeryctl-command-works-in-celery.html

2 are named karmanotes-[beta|prod] as expected. Occasionally they switch from idle to running.

The others are named with long hexadecimal strings. These appear to always be idle.

FinalsClub / karmaworld

celery tasks on beta are having the connection reset by peer #392