element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.23k stars 152 forks source link

Strange federation bug. Possibly in Synapse , matrix.org or both. #17035

Open shyrwall opened 5 months ago

shyrwall commented 5 months ago

Description

Hi

This will be a vague bug report so I'm hoping someone will come with a "aha" moment when reading this.

For the past week my homeserver has been unable to send outgoing federation messages towards matrix.org. Initially after starting synapse some messages go through but after a few seconds it stops and goes into a retry loop. During these retries matrix.org , or rather cloudflare, returns a http error of 520.

Upon further inspection i managed to narrow it down to a m.direct_to_device (very large json object) being posted by a single user. After deleting the event from device_federation_outbox everything worked again.

My theory is that the object was too large so matrix.org/cloudflare threw an error and Synapse just kept retrying.

If this is correct is there a bug in Synapse where it should somehow split this into multiple requests? Or is it a matrix.org bug that has a low limit on requests size?

Attaching the deleted event.

Thank you bad_edu.log

Steps to reproduce

-

Homeserver

xmr.se

Synapse Version

Multiple tested, 1.99 and up. Now 1.103.0

Installation Method

pip (from PyPI)

Database

postgresql. Single, no, no

Workers

Multiple workers

Platform

Not relevant.

Configuration

No response

Relevant log output

2024-03-25 19:22:26,936 - synapse.http.matrixfederationclient - 755 - INFO - federation_transaction_transmission_loop-24 - {PUT-O-29} [matrix.org] Got response headers: 520
2024-03-25 19:22:26,936 - synapse.http.matrixfederationclient - 798 - INFO - federation_transaction_transmission_loop-24 - {PUT-O-29} [matrix.org] Request failed: PUT matrix-federation://matrix.org/_matrix/federation/v1/send/1711389457952: HttpResponseException('520: ')

Anything else that would be useful to know?

No response

S7evinK commented 1 month ago

Did this happen again? Also, in the bad_edu.log, is 704458 the size of the EDU? https://github.com/element-hq/synapse/pull/17371 was merged recently, which had a similar symptom to the issue mentioned here. It may be that this is resolved by now.

shyrwall commented 3 weeks ago

Did this happen again? Also, in the bad_edu.log, is 704458 the size of the EDU? #17371 was merged recently, which had a similar symptom to the issue mentioned here. It may be that this is resolved by now.

Sorry for the late reply. For some reason github emails have been going to my spam folder. It has not happened again. Let's assume it was fixed.

shyrwall commented 1 week ago

Hi. I encountered the problem again in 1.113 and again with matrix.org . Couldn't log unfortunately but fixed by wiping the oldest edu which was ~600kb.

shyrwall commented 6 days ago

Maybe not helpful information but now when it happened again i just checked the size of messages_json is device_federation_outbox and deleted all rows over 300kb. Had about 10 rows between 300-600kb. After deleting all queued messages to matrix.org were completed instantly. I have attached one of the big EDUs.

EDIT: Just realised now that I may be confused with PDU vs EDU :)

[Uploading big_edu-240830.log…]()