element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.54k stars 189 forks source link

Simplified sliding sync returns negative values for some field where a positive one is expected #17737

Closed jmartinesp closed 1 month ago

jmartinesp commented 1 month ago

Description

We received a report from a user of Element X Android (@spaetz in here, if I'm not mistaken) where he couldn't get the app to work with SSS at all with one account in his homeserver but it worked fine with different one in the same HS, the cause seemed to be the client was cancelling each /sync request before the HS finished sending it (lots of 200! status codes for /sync requests).

After some debugging, we found these errors in the logs of the SDK:

2024-09-20T16:01:09.229399Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Deserialization(Json(Error("invalid value: integer `-67914`, expected u64", line: 1, column: 47547)))) | crates/matrix-sdk/src/http_client/mod.rs:228 | spans: sync_once > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9, V1_10, V1_11] config=RequestConfig { timeout: 60s } request_id="REQ-4" method=POST uri="https://matrix.sspaeth.de/_matrix/client/unstable/org.matrix.simplified_msc3575/sync" request_size="476 B" status=200 response_size="100.5 kiB"}
2024-09-20T16:01:09.229622Z ERROR matrix_sdk_ui::sync_service: Error while processing room list in sync service: SlidingSync failed: deserialization failed: invalid value: integer `-67914`, expected u64 at line 1 column 47547 | crates/matrix-sdk-ui/src/sync_service.rs:283

We can deduce some field has an invalid format, where some positive number was expected but we receive a negative one, and in the same logs we can find the same error happening at different indexes inside the /sync response, so it's a value that's repeated several times in the responses:

2024-09-20T16:01:09.772705Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Deserialization(Json(Error("invalid value: integer `-67914`, expected u64", line: 1, column: 44112)))) | crates/matrix-sdk/src/http_client/mod.rs:228 | spans: sync_once > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9, V1_10, V1_11] config=RequestConfig { timeout: 60s } request_id="REQ-11" method=POST uri="https://matrix.sspaeth.de/_matrix/client/unstable/org.matrix.simplified_msc3575/sync" request_size="476 B" status=200 response_size="100.5 kiB"}
2024-09-20T16:01:09.772976Z ERROR matrix_sdk_ui::sync_service: Error while processing room list in sync service: SlidingSync failed: deserialization failed: invalid value: integer `-67914`, expected u64 at line 1 column 44112 | crates/matrix-sdk-ui/src/sync_service.rs:283
...
2024-09-20T16:01:10.332448Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Deserialization(Json(Error("invalid value: integer `-67914`, expected u64", line: 1, column: 42678)))) | crates/matrix-sdk/src/http_client/mod.rs:228 | spans: sync_once > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9, V1_10, V1_11] config=RequestConfig { timeout: 60s } request_id="REQ-14" method=POST uri="https://matrix.sspaeth.de/_matrix/client/unstable/org.matrix.simplified_msc3575/sync" request_size="476 B" status=200 response_size="100.5 kiB"}
2024-09-20T16:01:10.332808Z ERROR matrix_sdk_ui::sync_service: Error while processing room list in sync service: SlidingSync failed: deserialization failed: invalid value: integer `-67914`, expected u64 at line 1 column 42678 | crates/matrix-sdk-ui/src/sync_service.rs:283

Sadly, we don't have the response contents to check which field is causing this issue.

Steps to reproduce

Homeserver

https://sspaeth.de/

Synapse Version

0.115.0

Installation Method

I don't know

Database

-

Workers

I don't know

Platform

Debian

Configuration

No response

Relevant log output

-

Anything else that would be useful to know?

No response

spaetz commented 1 month ago

Hi there, let me know if I can do anything to support debugging this. If you need access to the actual content being returned (or to the raw database), give me a shoutout. Happy to give you an access token or postgres access. Thanks in advance.

spaetz commented 1 month ago

Been randomly poking in some slidingsync tables:

select event_stream_ordering from sliding_sync_membership_snapshots WHERE event_stream_ordering<=0; event_stream_ordering -98673 -67914 -44846 -38983 -14132

Are there supposed to be negative values in here, for instance? (I have 1016 positive rows)

And just in case they are not supposed to be negative (and that is where I will stop poking in the dark). After truncating all slidingsync* tables, this table is being recreated with those same negative numbers by synapse 1.115, so it is not some historical garbage. (I know that the tables are currently repopiulated when empty, which will not work in future synapse versions).

spaetz commented 1 month ago

Err, the negative event_stream_ordering value of -67914 in my db corresponds to the wrong value as captured by the SDK log, so there seems to be an impedance mismatch on what is sent and what is expected.

Hywan commented 1 month ago

@spaetz Side note: thanks for helping and for your patience!

spaetz commented 1 month ago

You are welcome, it is in my own interest. BTW, great talk of yours today. You hiked all those peaks yourself? :-)

clokep commented 1 month ago

Negative stream ordering corresponds to backfilled events IIRC.

erikjohnston commented 1 month ago

This is odd. As clokep says, negative stream ordering happens when we backfill events. However, these stream orderings are for the membership events of local users, so how did they end up being backfilled?!

@spaetz did you by any chance delete the room? That might explain why local users membership events are being backfilled.

Can you also run:

SELECT membership, sender == user_id, event_stream_ordering FROM sliding_sync_membership_snapshots
WHERE event_stream_ordering<=0;
spaetz commented 1 month ago

SELECT membership, sender=user_id, event_stream_ordering FROM sliding_sync_membership_snapshots WHERE event_stream_ordering<=0;

 membership | ?column? | event_stream_ordering 
------------+----------+-----------------------
 leave      | t        |                -38983
 leave      | t        |                -14132
 leave      | t        |                -98673
 leave      | t        |                -44846
 join       | t        |                -67914

(5 Zeilen)

The last row is room #community-moderation-effort-bl:neko.dev (that I am actually still a member of). But I have powerlevel 0 (not allowed to write in the room).

I probably joined and left and rejoined that room in the history of the user, but I certainly have joined the room using this very synapse server.

erikjohnston commented 1 month ago

Thanks @spaetz, that was useful.

Hopefully fixed by: https://github.com/element-hq/synapse/pull/17748

spaetz commented 1 month ago

Just to confirm that 1.116rc1 has fixed the issue and SSS syncs now just fine with EXA!