Open Johennes opened 1 year ago
Speaking of which, a change which merged yesterday morning from @t3chguy has made my experience much better today. If you try the latest nightly build or develop.element.io* you may see a decent improvement **.
*: nightly builds and develop are both bleeding-edge and not guaranteed to work at all! If you'd prefer to wait, this change will appear in the release on 1st August.
**: but I still see some more minor problems, so we are definitely not at the end of the road yet.
This change also sparked another spec change proposal: https://github.com/matrix-org/matrix-spec-proposals/pull/4037
I appreciate all your hard work to get this resolved, thank you so much! I do understand things take time ("change will appear in the release on 1st August") but frankly still wish it was solved sooner in production (my entire startup team using element is complaining to me daily about it.)
I appreciate all your hard work to get this resolved, thank you so much! I do wonder if the indicated time-frame ("change will appear in the release on 1st August") is the optimal course of action with such critical bugs, that are clearly damaging to the project brand and core user experience.
Even though my own experience with stuck notifications is terrible, let's not rush the devs to make it two weeks earlier (we've been experiencing this for how long?). We know what happens with releasing stuff too early.
To aid debugging, here are some scripts we can use to gather information when debugging this:
Find my receipts in the Client for a room: https://gist.github.com/andybalaam/a20f44e24688378a659a6ed7f22f5944
Find my receipts in the Sync Accumulator for a room: https://gist.github.com/andybalaam/8c1f89434a0625bbe84553bac66d4520
Find my receipts in the Sync Accumulator for a thread in a room: https://gist.github.com/andybalaam/1548f9a5f1cd782c733d627e64dc4e12
Find the whole timeline of events in a room: https://gist.github.com/andybalaam/4daa10fffd8454b5de5fe78d11ba038a
To run these you will need to edit the scripts to insert the correct room ID and thread ID.
I appreciate all your hard work to get this resolved, thank you so much! I do understand things take time ("change will appear in the release on 1st August") but frankly still wish it was solved sooner in production (my entire startup team using element is complaining to me daily about it.)
Thanks! This delay reflects our normal release process. The fix happened after our recent release candidate (11th July), so it won't appear in our next release (18th July) but it will go into the one after that (1st August). This process gives us time to test changes before we release them, which is critical to reduce the chances of releasing a nasty bug.
In the meantime you are welcome to try it out by using nightly builds or the develop.element.io URL.
Late-arriving update (might please @samim23 ;-)
We discussed this a little more, and we are quite confident in the fix that went in recently - we think it has made the situation better, and we think it has low risk, so we backported it to our next release candidate, and if all goes to plan it will be included in the release on Tuesday 18th July.
Just wanted to say thanks to the team (in particular @andybalaam) for communicating like a champ and keeping us in the loop as of late. Can't speak for others, but it makes a massive difference to me and is rebuilding my faith in the project.
I've been testing develop.element.io and I can say that it has resolved a large proportion of the "stuck room notifications" issues. Well done!
Running Element 1.11.36 & Synapse 1.88.0 for 3 days now and so far all the "stuck" notification issues seem to be gone, at least for me. I still get occasional unread badges in rooms even though there are no new messages, but they are no longer stuck or climbing into the hundreds.
Running Element 1.11.36 & Synapse 1.88.0 for 3 days now and so far all the "stuck" notification issues seem to be gone, at least for me. I still get occasional unread badges in rooms even though there are no new messages, but they are no longer stuck or climbing into the hundreds.
I second this, the situation has much improve! There are occasional weird badges for sure, but that is manageable. Thanks to all the devs for the hard work on this!
Update for today:
Just for fun, if you're interested in why this is a difficult problem to fix, you might be interested in our proposed set of test cases that we intend to write to make sure it works as expected. At the time of writing, we have identified 146 cases we need to test, and we are not at all convinced we've listed them all. Hopefully this also encourages you that when we have all those test cases, the behaviour of the app in this area will be considerably more stable.
The topmost task list in the issue description has been updated with our latest plan of attack. Following on from the impactful fixes we've made in the past two weeks, we believe there are more code-level problems we can solve and, thus, are slightly deprioritizing pushing through on the MSCs.
It's definitely gotten a lot less common, but just caught it another time, this time it's properly stuck (not even reacting and removing the reaction again helps): #25846
I have a single notification from someone who @
tagged me. The notification shows up in the browser tab icon as well as in the Notifications menu. I've viewed the message but the notification is still there. Is this a stuck notification, or is there some way to clear it from the Notifications menu?
Edit: Turns out I just had to right-click the room (under Rooms in the left side-panel) and click "Mark as read". Leaving this here in case it's helpful for someone else.
We are continuing work in this area; the release gone out today carries yet more fixes for stuck notifications and yet more are awaiting review for "zombie notifications" (a phenomenon where notifications you have cleared come back on reloading the app). We're now focusing on creating a swathe of tests to aid in a major refactor to the threads data model which we believe will help with the stuck notifications project but also help fix some threads bugs.
We are continuing our work on this issue. We are investigating some open issues, and trying to close bugs that we think are already fixed. We have discovered some problems with our handling of unthreaded receipts (which are created when you choose "Mark Room As Unread") and we're working out how to tackle them.
We are continuing work on our comprehensive test suite which will give us confidence that we have fixed most of the problems when it is passing.
We are working on MSC4033 because we think some of these issues are difficult or impossible to solve without it.
My issues I had are also resolved with this release, thanks a lot for all your effort!
We are seeing things improve for quite a few users, but we know some problems remain, specifically in cases where message timestamps disagree with the order of messages as the server sees it, and when dismissing old threads by using Mark Room as Read.
We are continuing to work on these problems. We are writing tests that cover all the problems we have seen. Some of these currently fail, which helps us keep track of what situations still don't work correctly.
The next step after writing the tests will be to work on a change to the spec: MSC4033. This will remove the existing ambiguities about what order events are in, making the job of deciding which events are read much simpler. We will prototype this change, and if it solves the problems as we expect (the tests will help to know), we will continue to push for it to be included in the Matrix spec and implemented in homeservers.
If anyone reading continues to see problems, and can reproduce them consistently, bugs that describe clearly how to trigger a problem are still appreciated. For those who are able, what would be even better would be a new test within the test file that fails because it triggers the problem.
Thank you again for your patience. I think we have a good plan for how to squash these problems and never see them come back!
Work continues this week on improving and fleshing out the comprehensive test suite. Next on the list after that is working on MSC4033.
Here's a glimpse on where we're at with completing the test suite:
I'm impressed with this thorough engineering approach you've taken to solving the problem.
We are continuing with the work of filling out the test suite, although I've mostly been fighting vagaries of the CI setup this week :-). At some point soonish we will probably double the size of the suite by allowing tests to run in both encrypted and unencrypted rooms, since we expect different behaviour in quite a few cases (because the server can't examine encrypted messages to give an accurate unread count).
Progress on writing test cases: the number of implemented cases rose from 46 to 70, with the total expected number of cases now at 155.
I do expect the number of cases to double at some point when we allow tests to be run either encrypted or unencrypted mode.
A breakthrough this week: one of the tests uncovered a remaining stuck unread case: "Reading an unread thread after a redaction of the latest message makes it read" so when we make that test pass we know we will be removing real stuck unreads that people experience.
77 test cases implemented out of 155, after a slow week with security releases and day-to-day maintenance duties interfering. I'm hoping to make faster progress this week.
106 tests implemented out of 156. We are moving our focus on to figuring out why some of these tests sometimes fail randomly - I suspect some of them are actually real problems in the app. Once the tests are stable, we will work on fixing some of the problems we have already found, before continuing with the test suite.
I hope this will be fixed soon. Since this is the thread where all (new) issues are linked to, I will add some observations. I don't know, if all these are covered by test cases yet.
Notifications are stuck if:
Sometimes notifications reappear in old threads.
"Mark room read" does not solve the problem - sometimes not even for the moment, but if so, after reloading the notification return.
Recently it's not only the "bold" state that appears, but also "red" with an apparently random number, often quite large.
I cannot tell currently, whether only encrypted rooms are affected, but may be.
(Synapse: 1.92.3, Element 1.11.42 via Docker, as well as (Synapse: 1.93.0, Element 1.11.44 via Docker)
I noticed that when "Mark as read" doesn't work, reacting to a message clears the notification. So when this happens to me (and it still does happen very often) I click on an emoji reaction and remove it right away.
113 tests implemented out of 157:
I have some maintenance tasks I need to work on this week, and after that I will start on looking at fixing some of those failures, which should (eventually) result in some of the remaining problems people are seeing being fixed.
This week we have been mostly fighting flaky tests. Many of the flakes are caused by unreliable behaviour by Element when messages are edited or redacted, so may well be linked to real stuck unread cases.
Our next step is to pick some test flakes or failures and fix them, hopefully giving real improvements for those of use experiencing stuck unreads.
It's interesting that the number of failing tests wen up in October, despite no new tests being added. Perhaps sign of a regression?
I have indeed started seeing the issue of stuck notifications way more frequently in the past few days.
It's interesting that the number of failing tests wen up in October, despite no new tests being added. Perhaps sign of a regression?
Hmm. Maybe that chart is misleading - those failing tests are manually marked as failing. The reason the number went up is either:
So any correlation between the numbers rising and the number of actual issues is almost certainly coincidental I'm afraid.
I have indeed started seeing the issue of stuck notifications way more frequently in the past few days.
That is bad news, and probably indicates that something changed that is not currently covered by the tests. Any hints on what kinds of things are happening to cause these would be appreciated (e.g. did someone edit or delete a message, was it in a thread, etc.)
Unfortunately it's hard to pinpoint, I'm seeing it both in a private conversation without any threads, modifications or anything else I can think of and in two rooms (both with threads, modifications, etc.).
Unfortunately it's hard to pinpoint, I'm seeing it both in a private conversation without any threads, modifications or anything else I can think of and in two rooms (both with threads, modifications, etc.).
OK, thank you. Let us know if you notice any patterns.
I am not certain if it's totally new or not, but I have definitely noticed more stuck notifications around deleted messages in the past few days.
Investigating under #26363, I have discovered that redactions can cause messages to jump to another thread, but Element Web does not implement that behaviour. The spec proposal to fix this is: https://github.com/matrix-org/matrix-spec-proposals/pull/3389
@Johennes given my problem with updating this issue's description, please could you add this MSC at the top? Thanks!
Have added it
Another data point for you: I've been using Nheko as a client for ~1 week or so now. Up until today I had zero stuck notifications at all with it. This morning I had one stuck notification which was quickly resolved by adding an emoji reaction to the latest message in the channel.
(If this sort of message is not useful please let me know and I'll stop posting here.)
Another data point for you: I've been using Nheko as a client for ~1 week or so now. Up until today I had zero stuck notifications at all with it. This morning I had one stuck notification which was quickly resolved by adding an emoji reaction to the latest message in the channel.
(If this sort of message is not useful please let me know and I'll stop posting here.)
Interesting... My understanding is that Nheko displays all messages in the main timeline (with annotations to indicate threads), and only uses unthreaded receipts. I have heard of some stuck notifications in rooms without any threads, but they are certainly much rarer. My guess would be that the order of messages was somehow inconsistent between the client and server, so the server did not accept a "later" receipt, which it considered to be "earlier".
Update on this issue: the graph looks mostly unchanged this week, but we have started work on fixing some of the issues made clear by failing tests. This PR: matrix-js-sdk#3798 fixes several redaction tests, and will be merged after the release candidate branch has been made, to allow us maximum time to test in practice whether it makes things better or worse.
Another data point for you: The issue with stuck notification was significantly better about 1 month ago. The past 2 weeks about, things feel like they have gotten worse again. Today even the trick of saying on a channel "mark all as read" stopped working for me, the unread notification simply stay there.
Fun sidenote: One of our channels had a confetti emoji (which triggers this big onscreen animation) somewhere in the distant past. For some reason, this frequently gets flagged as unread and re-triggered, leading to a somewhat confusing confetti party.
Following this thread, i am aware how complex and far-reaching this set of related bugs is, and i salute all the developers ongoing massive efforts to get this resolve - i'm rooting for you all!
I just ran into https://github.com/vector-im/element-web/issues/25479, but when I select Mark Unread on the room (not after restarting). It's not the same ticket because that one's fixed, I think. So is this there an existing ticket for this, or should I create a new one?
I just ran into #25479, but when I select Mark Unread on the room (not after restarting). It's not the same ticket because that one's fixed, I think. So is this there an existing ticket for this, or should I create a new one?
Sounds like a new ticket. Do you mean "Mark as read"?
Update for this week - the redaction-related fixes are merged and released. I am working on fixes that affect editing messages.
Sounds like a new ticket. Do you mean "Mark as read"?
Yes indeed I do; sorry. New ticket is https://github.com/vector-im/element-web/issues/26475 .
Update for this week - the redaction-related fixes are merged and released. I am working on fixes that affect editing messages.
Correction - the redaction-related fixes are merged and included in today's release candidate, which will be released next week.
Update: I have found some incorrect assumptions in the code that stores and retrieves receipts inside matrix-js-sdk, so I am working on a change that will restructure this. I think it can explain some of the incorrect answers about what is read that we have been seeing.
I confess I've delayed this week's update until I could include this graph:
... which shows some of the effects of a lot of work that has been going on behind the scenes. The little drop in the red and green lines on the right-hand side represent a reduction in the number of known stuck notification bugs in the development version of the code. Most of the drop so far comes from a PR ignoring invalid receipts.
We fully expect this to reduce the number of real-world problems people are experiencing. Note: reduce the number, not eliminate (yet).
We also had a slight false-start this week, where we had to revert changes to move deleted messages to the main timeline because they caused a bug we fixed very late in the release cycle. However, we now have a proper fix for this problem, and we expect to "unrevert" the revert just after the next release candidate. (The timing here allows us time to test this change, which is disruptive, on the bleeding-edge deployment at develop.element.io for as long as possible before releasing.)
Meanwhile, we have found some limitations in the way receipts are stored in matrix-js-sdk, which explain some of the other problems we are seeing, especially with "Mark room as read" not working as expected. We are re-writing the receipt-handling code to reflect our new understanding, which won't be super-fast, but we are hopeful that it will make a big difference.
I know this process is painfully slow when viewed from the outside, but rest assured we are working hard on it, and we are working in an organised and systematic way, using the tests as the source of truth. This means that when we take one step forward, we are less likely to take two steps back and cause new bugs with our fixes. For an explanation of how complex some of the logic we are working through here is, check out my recent Matrix Live video.
We are experiencing multiple issues in the area of "stuck" notifications and unread markers. Many are related to the way receipts interact with threaded messages.
Symptoms
We are seeing different symptoms of the problem:
Spec-level causes
Message ordering
Fundamentally, in order to interpret the meaning of a receipt that says "I have read everything up to here", we need to know what order messages are in. This is not clear in the spec, and we propose to make it clear and explicit in MSC4033.
In the meantime, Element Web uses a combination of "sync" order (the implicit order of events arriving via a /sync request) and "timestamp" order (using the
ts
property within events).Some of the existing bugs are probably caused by this inconsistency, but it is not clear yet how many: we believe there are also bugs in the implementation that cause additional problems, and this theoretical inconsistency is only the cause of a few problems.
Which thread the root belongs to
The spec has what we consider a bug when it talks about which thread the root message belongs to, which has been reflected in client code, making it inconsistent with the server implementation (at least on the Synapse server). We have a proposal to fix this bug in MSC4037.
Identifying which thread any message is in
It is sometimes difficult for clients to identify which thread an event belongs to, meaning that a receipt pointing to it is sometimes ignored. We have begun drafting MSC4023 to address this.
Other
Previously, we believed that MSC3981 (recursive relations) would solve some of the problems, but since that MSC does not solve the event-ordering problem (because the events from the /relations API are returned in "topological" order) we no longer believe it is important, except as a performance optimisation. Code-level causes
Code-level causes
We have found and fixed several bugs in the Element Web code that were caused by an incomplete understanding of the meaning of threaded and unthreaded read receipts. We anticipate that some more exist.
(We believe that the primary reason why we're not seeing the same problems on mobile is that the apps persist events they've received whereas Element Web has to re-fetch from scratch after every launch. As a result, any issue in the unread state logic, strikes again and again. The apps also use a single timeline whereas Element Web maintains one timeline per thread in addition to the main timeline in every room.)
High-level plan of actions
We believe a lot of progress can still be made without spec changes. So we're slightly deprioritising work on the MSCs.
New issue inbox
The following is a holding area for newly reported issues that require review. Once reviewed, issues should either be moved to one of the other task lists below or, if not applicable, removed from this epic.
Tasks not blocked by spec work
Tasks that are related to or dependent on spec work
We've written the following MSCs to try and address the root causes in a reliable and performant way:
/event
to fetch the parent has been implemented. This is functionally correct but has a noticeable performance impact.Issues that are related but out of scope
Time sheeting
WEB: Stuck notifications