Closed ara4n closed 5 years ago
@turt2live can you rageshake?
Past related issues (not as common as whatever happened in the last 24h):
We're declaring this not a release blocker since we have seen very limit incidence of it. One to keep an eye on, though.
Latest symptom from https://github.com/matrix-org/riot-web-rageshakes/issues/1238 is #riot-dev was only showing ~50% of the messages received over /sync, which lead to me thinking I was replying to someone that already had an answer :(
@richvdh observes:
yeah :/. I suspect it's in code I wrote and happens when there's a fork in the DAG or something
Rageshake and reload button?
Just sent a rageshake on this. I assume I was getting the same issue (the console error is the same but this issue doesn't describe the user facing symptoms very well).
From the rageshake notes: Visible symptoms were that there were lots of old messages below the latest message in the room. Clicking the scroll to the bottom button actually scrolled you up because the latest message in the room was not at the bottom of the page but somewhere in the middle.
That's consistent with the other known (but for some reason not documented) symptom. I assume it felt a bit like a rotating drum rather than a timeline?
Edit: rageshake definitely is the same issue.
Yes something like that. Now after I sent the rageshake the entire app has become unresponsive. I can't click on anything, the green read receipt bar doesn't update, and animations for other users read receipts is frozen. Do you want another rageshake?
I suspect that symptom is a different issue for sure (one of the "the app sets fire to my computer trying to start up" issues), and a rageshake might not reveal it. Would suggest opening a new issue and rageshaking on that just in case though.
To add more detail to my very brief comment above - "Rageshake and reload button?" means "Can we mitigate this difficult-to-reproduce/investigate bug by catching the error and prompting the user to 'Rageshake and reload'.
Obviously this is horrible, but better than leaving users to discover the app is broken by its just being super broken and does at least present a way out.
ftr, tracking the workaround dialog as https://github.com/vector-im/riot-web/issues/9260 to avoid closing this by accident.
ftr we de-fired this pending more information from https://github.com/vector-im/riot-web/issues/9260. If people encounter this, please rageshake.
so at least one case of this is possibly having a server come back to life after being down for a long time. https://github.com/matrix-org/riot-web-rageshakes/issues/1333 shows that at around 18:58:18 a server started processing ~100 bot commands, which Riot was trying to add to the timeline fairly quickly. It looks like the event timestamps are in the past, so possibly that messed up the backfilling? Shortly after the block of events gets processed, the timeline explodes.
The function which is doing the logging there sets neighbouring timelines, which is definitely a good indicator of being a source of problems for when it later explodes due to having such a neighbour.
https://github.com/matrix-org/riot-web-rageshakes/issues/1334 just shows that the rageshake in 1333 is still a problem after refreshing - nothing of interest.
https://github.com/matrix-org/riot-web-rageshakes/issues/1335 happens some time after debugging it - the interesting part is a "cannot reset neighbour" error followed by a timeline is no longer live problem later on.
I've mirrored the bundle.js causing issues to https://t2l.io/_matrix/media/v1/download/t2l.io/2991333c56d72a5d217b3c07d722ba985844a616 so that the line numbers line up.
a bunch of timelines blew up all at once it seems... here's some rageshake analysis:
9:33
to 8:36
(refused: live timeline), then moments later 8:36
to 9:33
(accepted), then 9:33
to 8:36
again (refused: live timeline). This then exploded when trying to add live events to the timeline, and an error immediately after where the explosion was caught.7:04
to 9:37
(accepted), then 9:37
to 7:04
(refused: live timeline). The timeline explosion happened about 200ms before this happened, however. It then continued to explode.8:11
to 9:38
(accepted) and then 9:38
to 8:11
(refused: live timeline). The explosion happened after the double splicing, unlike 1343.9:42
to 9:15
then 9:15
to 9:42
. The build is custom and older, so it did not pick up on the logging added in https://github.com/matrix-org/matrix-js-sdk/pull/873Something feels a bit wrong if we're doubly linking timelines, but failing to do so effectively. It appears as though we're creating a split brain with https://github.com/matrix-org/matrix-js-sdk/pull/873 being a thing, but it's a bit hard to tell. It's clear that we're getting into a wrong state though: we should never be trying to set a previous timeline as a neighbour, but somehow we are anyways and that causes an explosion.
For refs:
Where all the rageshakes are refusing to splice: https://github.com/matrix-org/matrix-js-sdk/blob/cd3c6809a94f6f346b6ba9760eb86df00136da03/src/models/event-timeline-set.js#L418-L421
Where the timelines are blowing up: https://github.com/matrix-org/matrix-js-sdk/blob/cd3c6809a94f6f346b6ba9760eb86df00136da03/src/models/event-timeline.js#L278-L279
Perhaps the idea of half-linking the double-linked timeline was wrong then - we should either doublelink it or not at all.
https://github.com/matrix-org/riot-web-rageshakes/issues/1354 is our first non-develop, non-custom instance of this. It's a desktop build for 1.0.6, which means it also missed https://github.com/matrix-org/matrix-js-sdk/pull/873
Perhaps the idea of half-linking the double-linked timeline was wrong then - we should either doublelink it or not at all.
I'll open a PR for doing this.
https://github.com/matrix-org/riot-web-rageshakes/issues/1356, https://github.com/matrix-org/riot-web-rageshakes/issues/1357, and https://github.com/matrix-org/riot-web-rageshakes/issues/1360 all look similar in that they try to do splicing and then fall over. They're also a mix of develop and release, and good indicators that this is happening a lot more often than originally thought.
Daily rageshake review for explosions:
Overall the four potential causes (in terms of what the logs say) are:
A few more came in:
https://github.com/matrix-org/riot-web-rageshakes/issues/1371, https://github.com/matrix-org/riot-web-rageshakes/issues/1372, and https://github.com/matrix-org/riot-web-rageshakes/issues/1373 all look like a case of number 3. Timelines are being spliced when they are only a few seconds old.
Another day, another batch of rageshakes. This is the last batch of rageshakes from /app I'll take a look at until we publish a release with some of the fixes - we have more than enough data points.
https://github.com/matrix-org/riot-web-rageshakes/issues/1389 is an explosion on develop which appears to be complaining about pagination tokens. In theory, https://github.com/matrix-org/matrix-js-sdk/pull/885 fixes this.
For context, the error is thrown here: https://github.com/matrix-org/matrix-js-sdk/blob/b1b49413d0e8c0766bc9fd51ad9acd6da83a41d3/src/models/room.js#L1294-L1297
Only one rageshake with the new patches post-release so far: https://github.com/matrix-org/riot-web-rageshakes/issues/1403
Looks like the first error happened in 0004, and the user finally got frustrated enough to send a bug report. Looks to be a case of splicing timelines which are very close to each other, which may be an indicator of a race somewhere:
Already have timeline for $redacted - joining timeline !redacted:2019-03-16T15:11:47.610Z to !redacted:2019-03-16T15:11:47.116Z
That's a 500ms difference between timelines.
There hasn't been much activity in terms of rageshakes for this (even after we brought the rageshake server back online) - we are planning to presume this fixed without complaints saying otherwise.
My Riot just blew up trying to (forward?)fill #irc:matrix.org, though I don't know if it is this bug. Will rageshake and if one of the Riot devs think it's not this bug, please shout.
@Half-Shot your client is missing all of the patches which are supposed to fix this
Ahhhhhh kk
There haven't been any complaints and no further rageshakes for a long while, which means I'm confident enough to say this is fixed. If the issue persists for people, please open a new issue.
Leaving this open to track the removal of the prompt: https://github.com/matrix-org/matrix-react-sdk/pull/2939
I've seen this ~3 times today, and Travis is getting bitten by it repeatedly.
Symptoms are stacktrace of:
And a message you try to send gets stuck in localecho state at the bottom of the timeline (in practice it successfully sends).