Open bowlofeggs opened 5 years ago
So after a lot of staring at the backtrace and the code, I believe what's happening is publish
is being called in the composer consumer, but no exceptions are being handled so when the connection error occurs this is bubbled up into the consumer loop and crashes the consumer.
The publish API is raising the proper "ConnectionException" in response to "pika.exceptions.StreamLostError" so this isn't exactly a bug in fedora-messaing, but what it's not doing is automatically retrying before giving up because it's entering the generic AMQPError handler. We should fix this, especially because the blocking connection times out regularly as it cannot heartbeat. This will cause users to hit it less frequently, but they still need to handle exceptions where the automatic retry doesn't work.
Hmm, I was able to reproduce this again with fedora-messaging-1.7.0:
-- Logs begin at Wed 2018-10-31 23:49:56 UTC, end at Wed 2019-05-22 21:23:39 UTC. --
2019-05-22 21:17:49,912 ERROR [bodhi][f29-updates] Pungi exited with exit code 1
2019-05-22 21:17:50,020 ERROR [bodhi][f29-updates] Stderr: Traceback (most recent call last):
File "/usr/bin/pungi-koji", line 482, in <module>
main()
File "/usr/bin/pungi-koji", line 261, in main
run_compose(compose, create_latest_link=create_latest_link, latest_link_status=latest_link_status)
File "/usr/bin/pungi-koji", line 381, in run_compose
essentials_phase.stop()
File "/usr/lib/python3.7/site-packages/pungi/phases/weaver.py", line 53, in stop
self.pool.stop()
File "/usr/lib/python3.7/site-packages/kobo/threads.py", line 120, in stop
six.reraise(exc_info[0], exc_info[1], exc_info[2])
File "/usr/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3.7/site-packages/kobo/threads.py", line 67, in run
self.process(item, num)
File "/usr/lib/python3.7/site-packages/pungi/phases/weaver.py", line 69, in process
phase.start()
File "/usr/lib/python3.7/site-packages/pungi/phases/base.py", line 64, in start
self.run()
File "/usr/lib/python3.7/site-packages/pungi/phases/gather/__init__.py", line 111, in run
manifest=self.manifest)
File "/usr/lib/python3.7/site-packages/pungi/phases/gather/link.py", line 122, in link_files
pool.stop()
File "/usr/lib/python3.7/site-packages/kobo/threads.py", line 120, in stop
six.reraise(exc_info[0], exc_info[1], exc_info[2])
File "/usr/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3.7/site-packages/kobo/threads.py", line 67, in run
self.process(item, num)
File "/usr/lib/python3.7/site-packages/pungi/linker.py", line 63, in process
self.pool.linker.link(src, dst, link_type=self.pool.link_type)
File "/usr/lib/python3.7/site-packages/pungi/linker.py", line 214, in link
self._link_file(src, dst, link_type)
File "/usr/lib/python3.7/site-packages/pungi/linker.py", line 207, in _link_file
self.copy(src, dst)
File "/usr/lib/python3.7/site-packages/pungi/linker.py", line 182, in copy
shutil.copy2(src, dst)
File "/usr/lib64/python3.7/shutil.py", line 263, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib64/python3.7/shutil.py", line 121, in copyfile
with open(dst, 'wb') as fdst:
OSError: [Errno 28] No space left on device: '/mnt/koji/compose/updates/Fedora-29-updates-20190522.1/compose/Everything/armhfp/debug/tree/Packages/n/NetworkManager-team-debuginfo-1.12.6-5.fc29.armv7hl.rpm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/pungi-koji", line 489, in <module>
COMPOSE.write_status("DOOMED")
File "/usr/lib/python3.7/site-packages/pungi/compose.py", line 326, in write_status
self.notifier.send('status-change', status=stat_msg)
File "/usr/lib/python3.7/site-packages/pungi/notifier.py", line 85, in send
self._run_script(cmd, msg, workdir, kwargs)
File "/usr/lib/python3.7/site-packages/pungi/notifier.py", line 107, in _run_script
logfile=logfile)
File "/usr/lib/python3.7/site-packages/kobo/shortcuts.py", line 288, in run
log = open(logfile, mode)
OSError: [Errno 28] No space left on device: '/mnt/koji/compose/updates/Fedora-29-updates-20190522.1/logs/global/notifications/notification-2019-05-22_21-17-48.log'
2019-05-22 21:17:50,272 ERROR [bodhi][f29-updates] Exception in ComposerThread(f29-updates)
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
2019-05-22 21:17:50,495 INFO [bodhi][f29-updates] Compose object updated.
2019-05-22 21:17:50,495 INFO [bodhi][f29-updates] Thread(f29-updates) finished. Success: False
2019-05-22 21:17:50,498 ERROR [pika.adapters.utils.io_services_utils][f29-updates] _AsyncBaseTransport._consume() failed, aborting connection: error=SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2488)'); sock=<ssl.SSLSocket fd=24, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.5.128.79', 33970)>; Caller's stack:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
super(_AsyncSSLTransport, self)._consume()
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
return sock.recv(max_bytes)
File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
return self.read(buflen)
File "/usr/lib64/python3.7/ssl.py", line 913, in read
return self._sslobj.read(len)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
super(_AsyncSSLTransport, self)._consume()
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
return sock.recv(max_bytes)
File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
return self.read(buflen)
File "/usr/lib64/python3.7/ssl.py", line 913, in read
return self._sslobj.read(len)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
2019-05-22 21:17:50,499 ERROR [pika.adapters.utils.io_services_utils][f29-updates] Wrapped func exited with exception. Caller's stack:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pika/diagnostic_utils.py", line 53, in log_exception_func_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
super(_AsyncSSLTransport, self)._consume()
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
return sock.recv(max_bytes)
File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
return self.read(buflen)
File "/usr/lib64/python3.7/ssl.py", line 913, in read
return self._sslobj.read(len)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/pika/diagnostic_utils.py", line 53, in log_exception_func_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
super(_AsyncSSLTransport, self)._consume()
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
return func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
return sock.recv(max_bytes)
File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
return self.read(buflen)
File "/usr/lib64/python3.7/ssl.py", line 913, in read
return self._sslobj.read(len)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
2019-05-22 21:17:50,499 ERROR [pika.adapters.base_connection][f29-updates] connection_lost: StreamLostError: ("Stream connection lost: SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2488)')",)
2019-05-22 21:17:50,500 ERROR [pika.adapters.blocking_connection][f29-updates] Unexpected connection close detected: StreamLostError: ("Stream connection lost: SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2488)')",)
2019-05-22 21:17:50,500 INFO [fedora_messaging._session][f29-updates] Resetting connection to rabbitmq.stg.fedoraproject.org
2019-05-22 21:17:50,610 INFO [bodhi][f29-updates] Compose object updated.
2019-05-22 21:17:50,610 ERROR [bodhi][f29-updates] ComposerThread failed. Transaction rolled back.
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 316, in run
self.work()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 386, in work
self._compose_updates()
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 925, in _compose_updates
self._wait_for_pungi(pungi_process)
File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/composer.py", line 1211, in _wait_for_pungi
raise Exception('Pungi exited with status %d' % pungi_process.returncode)
Exception: Pungi exited with status 1
2019-05-22 21:17:50,710 INFO [bodhi][f29-updates] Released semaphore
2019-05-22 21:17:50,710 INFO [bodhi][PoolThread-twisted.internet.reactor-0] Push complete! Summary follows:
2019-05-22 21:17:50,710 INFO [bodhi][PoolThread-twisted.internet.reactor-0] name: f29-updates success: False
2019-05-22 21:17:50,710 INFO [fedora_messaging.twisted.protocol][MainThread] Successfully consumed message from topic org.fedoraproject.stg.bodhi.composer.start (message id c1f3c409-2050-4524-aa95-7e0f4247bf6a)
It's not immediately obvious to me by looking at the tracebacks whether this is due to my publishing code, or whether this is due to my consuming code. Since this is the composer, we are both publishing and consuming in the same process. And it is likely that I attempted to publish a message about the failure due to being out of disk space, so perhaps that is what triggered it. However, I do see calls to a _consume()
function in the traceback too, so…? Might need to get in there with a debugger to figure out more, but that's also tricky since I don't yet know a reliable reproducer.
I just did a test with Bodhi:
0) I added a log statement in Bodhi's publish function to log every message before it tries to send. 1) I ran a compose. 2) The compose did generate the SSLEOFError. 3) I verified that every message logged made it to datagrepper.
Thus, I am satisfied that this error is actually solved. I think it's just getting logged by pika which is why it shows up in my logs, but fedora-messaging does seem to respond to it favorably and makes sure the message is still sent. On my end, I will probably just stop logging pika so I don't see the tracebacks and call it a day.
We can close this if you want, unless there's something you'd like to use it to track.
I'd like to keep this open to track not leaving stale connections open in the publish code.
The reason I originally kept the connection around is making a new TLS connection is expensive and if you publish rapidly you can re-use the connection. The problem is most things don't publish rapidly and this leaves open connections that the broker eventually has to clean up along with nasty tracebacks.
We either need to find a way to heartbeat with the synchronous API (e.g. run a thread somewhere) or open and close a connection per publish.
In Fedora's staging environment, I have observed this problem in both OpenShift and bodhi-backend01.stg today:
In the case of Bodhi's OpenShift consumer, it ends up crashing the process, systemd restarts it, the message is tried again, and it was successful.
However, in Bodhi's composer, the composer itself caught the exception (it has a catch all so we can log and mark the compose as failed), which means that the message was ultimately ACK'd and the process did not halt and the compose was left in the failed state.
Should it be the responsibility of the application (i.e., Bodhi) to catch this kind of error, or should fedora-messaging catch and handle it for me? I'm not sure fedora-messaging could catch it in the case of Bodhi's composer…