element-hq / element-web

A glossy Matrix collaboration client for the web.
https://element.io
GNU Affero General Public License v3.0
11.15k stars 1.99k forks source link

Timeline corruption on develop - "Live timeline 0 is no longer live" #8593

Closed ara4n closed 5 years ago

ara4n commented 5 years ago

I've seen this ~3 times today, and Travis is getting bitten by it repeatedly.

Symptoms are stacktrace of:

Caught /sync error Error: live timeline 0 is no longer live - it has a neighbouring timeline
    at Room../matrix-js-sdk/lib/models/room.js.Room.addLiveEvents (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:28527:19)
    at SyncApi../matrix-js-sdk/lib/sync.js.SyncApi._processRoomEvents (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:35245:10)
    at _callee9$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34809:54)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
From previous event:
    at https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34858:46
From previous event:
    at SyncApi._callee10$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34657:74)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
From previous event:
    at SyncApi._processSyncResponse (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34946:22)
    at SyncApi._callee7$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34294:60)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
consoleObj.(anonymous function) @ rageshake.js:61
16:44:19.277 rageshake.js:61 

And a message you try to send gets stuck in localecho state at the bottom of the timeline (in practice it successfully sends).

ara4n commented 5 years ago

@turt2live can you rageshake?

ara4n commented 5 years ago

https://github.com/matrix-org/riot-web-rageshakes/issues/1217

turt2live commented 5 years ago

Past related issues (not as common as whatever happened in the last 24h):

lampholder commented 5 years ago

We're declaring this not a release blocker since we have seen very limit incidence of it. One to keep an eye on, though.

turt2live commented 5 years ago

Latest symptom from https://github.com/matrix-org/riot-web-rageshakes/issues/1238 is #riot-dev was only showing ~50% of the messages received over /sync, which lead to me thinking I was replying to someone that already had an answer :(

ara4n commented 5 years ago

@richvdh observes:

yeah :/. I suspect it's in code I wrote and happens when there's a fork in the DAG or something

lampholder commented 5 years ago

Rageshake and reload button?

aaronraimist commented 5 years ago

Just sent a rageshake on this. I assume I was getting the same issue (the console error is the same but this issue doesn't describe the user facing symptoms very well).

From the rageshake notes: Visible symptoms were that there were lots of old messages below the latest message in the room. Clicking the scroll to the bottom button actually scrolled you up because the latest message in the room was not at the bottom of the page but somewhere in the middle.

turt2live commented 5 years ago

That's consistent with the other known (but for some reason not documented) symptom. I assume it felt a bit like a rotating drum rather than a timeline?

Edit: rageshake definitely is the same issue.

aaronraimist commented 5 years ago

Yes something like that. Now after I sent the rageshake the entire app has become unresponsive. I can't click on anything, the green read receipt bar doesn't update, and animations for other users read receipts is frozen. Do you want another rageshake?

turt2live commented 5 years ago

I suspect that symptom is a different issue for sure (one of the "the app sets fire to my computer trying to start up" issues), and a rageshake might not reveal it. Would suggest opening a new issue and rageshaking on that just in case though.

lampholder commented 5 years ago

To add more detail to my very brief comment above - "Rageshake and reload button?" means "Can we mitigate this difficult-to-reproduce/investigate bug by catching the error and prompting the user to 'Rageshake and reload'.

Obviously this is horrible, but better than leaving users to discover the app is broken by its just being super broken and does at least present a way out.

turt2live commented 5 years ago

ftr, tracking the workaround dialog as https://github.com/vector-im/riot-web/issues/9260 to avoid closing this by accident.

turt2live commented 5 years ago

ftr we de-fired this pending more information from https://github.com/vector-im/riot-web/issues/9260. If people encounter this, please rageshake.

turt2live commented 5 years ago

so at least one case of this is possibly having a server come back to life after being down for a long time. https://github.com/matrix-org/riot-web-rageshakes/issues/1333 shows that at around 18:58:18 a server started processing ~100 bot commands, which Riot was trying to add to the timeline fairly quickly. It looks like the event timestamps are in the past, so possibly that messed up the backfilling? Shortly after the block of events gets processed, the timeline explodes.

The function which is doing the logging there sets neighbouring timelines, which is definitely a good indicator of being a source of problems for when it later explodes due to having such a neighbour.

https://github.com/matrix-org/riot-web-rageshakes/issues/1334 just shows that the rageshake in 1333 is still a problem after refreshing - nothing of interest.

https://github.com/matrix-org/riot-web-rageshakes/issues/1335 happens some time after debugging it - the interesting part is a "cannot reset neighbour" error followed by a timeline is no longer live problem later on.

I've mirrored the bundle.js causing issues to https://t2l.io/_matrix/media/v1/download/t2l.io/2991333c56d72a5d217b3c07d722ba985844a616 so that the line numbers line up.

turt2live commented 5 years ago

a bunch of timelines blew up all at once it seems... here's some rageshake analysis:

Something feels a bit wrong if we're doubly linking timelines, but failing to do so effectively. It appears as though we're creating a split brain with https://github.com/matrix-org/matrix-js-sdk/pull/873 being a thing, but it's a bit hard to tell. It's clear that we're getting into a wrong state though: we should never be trying to set a previous timeline as a neighbour, but somehow we are anyways and that causes an explosion.

For refs:

Where all the rageshakes are refusing to splice: https://github.com/matrix-org/matrix-js-sdk/blob/cd3c6809a94f6f346b6ba9760eb86df00136da03/src/models/event-timeline-set.js#L418-L421

Where the timelines are blowing up: https://github.com/matrix-org/matrix-js-sdk/blob/cd3c6809a94f6f346b6ba9760eb86df00136da03/src/models/event-timeline.js#L278-L279

turt2live commented 5 years ago

https://github.com/matrix-org/riot-web-rageshakes/issues/1352 broke differently: it blew up on https://github.com/matrix-org/matrix-js-sdk/blob/61aea05af0b45452ea8d4727a6f9138770559761/src/models/room.js#L1294-L1297

ara4n commented 5 years ago

Perhaps the idea of half-linking the double-linked timeline was wrong then - we should either doublelink it or not at all.

turt2live commented 5 years ago

https://github.com/matrix-org/riot-web-rageshakes/issues/1354 is our first non-develop, non-custom instance of this. It's a desktop build for 1.0.6, which means it also missed https://github.com/matrix-org/matrix-js-sdk/pull/873


Perhaps the idea of half-linking the double-linked timeline was wrong then - we should either doublelink it or not at all.

I'll open a PR for doing this.

turt2live commented 5 years ago

https://github.com/matrix-org/riot-web-rageshakes/issues/1356, https://github.com/matrix-org/riot-web-rageshakes/issues/1357, and https://github.com/matrix-org/riot-web-rageshakes/issues/1360 all look similar in that they try to do splicing and then fall over. They're also a mix of develop and release, and good indicators that this is happening a lot more often than originally thought.

turt2live commented 5 years ago

Daily rageshake review for explosions:

Overall the four potential causes (in terms of what the logs say) are:

  1. Something to do with splicing the live timeline into a position where it cannot operate (what https://github.com/matrix-org/matrix-js-sdk/pull/877 and prior try to solve)
  2. Probable side effects of https://github.com/matrix-org/matrix-js-sdk/pull/873 causing "cannot reset timeline - it has a neighbouring timeline" when handling a limited sync.
  3. New to this batch - possible rapid splicing causing problems.
  4. Also new to this batch - having pagination tokens on timelines, which may be fixed by https://github.com/matrix-org/matrix-js-sdk/pull/877
turt2live commented 5 years ago

A few more came in:

https://github.com/matrix-org/riot-web-rageshakes/issues/1371, https://github.com/matrix-org/riot-web-rageshakes/issues/1372, and https://github.com/matrix-org/riot-web-rageshakes/issues/1373 all look like a case of number 3. Timelines are being spliced when they are only a few seconds old.

turt2live commented 5 years ago

Another day, another batch of rageshakes. This is the last batch of rageshakes from /app I'll take a look at until we publish a release with some of the fixes - we have more than enough data points.

turt2live commented 5 years ago

https://github.com/matrix-org/riot-web-rageshakes/issues/1389 is an explosion on develop which appears to be complaining about pagination tokens. In theory, https://github.com/matrix-org/matrix-js-sdk/pull/885 fixes this.

For context, the error is thrown here: https://github.com/matrix-org/matrix-js-sdk/blob/b1b49413d0e8c0766bc9fd51ad9acd6da83a41d3/src/models/room.js#L1294-L1297

turt2live commented 5 years ago

Only one rageshake with the new patches post-release so far: https://github.com/matrix-org/riot-web-rageshakes/issues/1403

Looks like the first error happened in 0004, and the user finally got frustrated enough to send a bug report. Looks to be a case of splicing timelines which are very close to each other, which may be an indicator of a race somewhere:

Already have timeline for $redacted - joining timeline !redacted:2019-03-16T15:11:47.610Z to !redacted:2019-03-16T15:11:47.116Z

That's a 500ms difference between timelines.

turt2live commented 5 years ago

There hasn't been much activity in terms of rageshakes for this (even after we brought the rageshake server back online) - we are planning to presume this fixed without complaints saying otherwise.

Half-Shot commented 5 years ago

My Riot just blew up trying to (forward?)fill #irc:matrix.org, though I don't know if it is this bug. Will rageshake and if one of the Riot devs think it's not this bug, please shout.

turt2live commented 5 years ago

@Half-Shot your client is missing all of the patches which are supposed to fix this

Half-Shot commented 5 years ago

Ahhhhhh kk

turt2live commented 5 years ago

There haven't been any complaints and no further rageshakes for a long while, which means I'm confident enough to say this is fixed. If the issue persists for people, please open a new issue.

Leaving this open to track the removal of the prompt: https://github.com/matrix-org/matrix-react-sdk/pull/2939