Open Johennes opened 1 year ago
Descoping https://github.com/vector-im/element-web/issues/24373 as this doesn't have the same level of severity as the other issues.
As mentioned in the issue description, the remaining bugs will require a spec change. We're currently in the process of drafting the MSC but it'll, unfortunately, be more involved than just a simple client-side bug fix.
I don't think that is true. If you look at the comments in vector-im/element-web#24595, you will find some from rooms where threads were never used and IME clearing cache often fixes these issues.
We're focusing on the most prominent problems first which are in the interplay of threads and other relations. There may be further issues beyond that.
I had the same issue in some rooms. Because I don't participate that much in this rooms I tried to write a test-message and it seems that the stuck notifications are gone.
@8227846265 the threads-related stuck notification issues are blocking on https://github.com/matrix-org/matrix-spec-proposals/pull/3981.
We've completed initial implementations for it last week and are now starting to prepare for testing it. Unfortunately, it'll still take some more time to get this landed due to the nature of the spec process.
@Johennes Thanks for this explanation. If there is an environment where a regular user could help test it, I'd love to hear about it.
Thanks @leonardehrenfried. We will most likely be enabling it on beta.matrix.org as to not impact regular users but you can sign-up for a separate account there if you want to help testing.
@Johennes , thanks for the explanation indeed! How soon it could be expected to land to the self hosted solutions?
I'd expect the initial PRs (https://github.com/matrix-org/synapse/pull/15315 & https://github.com/matrix-org/matrix-js-sdk/pull/3248) to land this week.
For everybody subscribed to this ticket: the backend and frontend PRs have been merged now so I expect a rollout quite soon.
(I'm not an Element/Matrix employee, just an interested user.)
@8227846265 we have one more fix in the making that doesn't depend on the MSC (https://github.com/vector-im/element-web/issues/25196) and should land soon. If that one doesn't help your case, we'd be interested to hear more details about your particular error scenario.
@neilisfragile sent logs in https://github.com/matrix-org/element-web-rageshakes/issues/21575 @kittykat sent logs in https://github.com/matrix-org/element-web-rageshakes/issues/21573
If it helps figure out what introduced it, I'm not seeing the issue that I've got on app.element.io
Gack, github tasklist overwrote the whole issue with an old version. I tried to fix it - sorry if I did it wrong
I see that things are addressed, but at the same time, it takes time to fix things. Is there probably any workaround? Like, I don't know - restart the server with some cache cleanup, etc.?
It indeed makes use of the app quite complicated...
You can try #24392 (comment)
@AlBundy33 , thank you!
Do you mean?
I had the same issue in some rooms. Because I don't participate that much in this rooms I tried to write a test-message and it seems that the stuck notifications are gone.
If so - does the test message is some special service command? If not, I believe I'm participating in the channels, I can see the threads with messages that are presumably not read, but they are kind of 'unread forever'...
yes - I've tried different things.
Then I wrote a message into the affected rooms (I think I did some of the steps above before). After that the rooms where fixed even after a restart of the client.
Can someone give an update on what the expected state of affairs is on this? Should we expect things to be fixed? Or is there still ongoing work? Empirically, my element remains absolutely cluttered with stuck notifications.
Since this is a meta-issue on a problem that's been ongoing for many months, it would be nice to get some updates on this every once in a while.
Hi @Valodim, thanks for the message - I can see lots of people are still affected by this. You can follow our progress somewhat by looking at the tasks closed in the task lists above, but the situation is:
Rest assured that we are still aware this is an ongoing problem, and we continue to work on it. The reason it's been so long is that our early work uncovered tons of new issues that were masked by the fact that we had a bug incorrectly marking threads as read when they were not.
I'd also like to say again how sorry we are for the pain this is causing to lots of people. We are all heavy users of Element Web, and have been feeling the pain too. We are determined to fix this problem properly, not patch over it with kinda-working hacks.
Thanks for your response Andy. I understand it's a tough issue and we all very much appreciate your work on this :+1:
You can follow our progress somewhat by looking at the tasks closed in the task lists above
I understand where you're coming from, but this is actually pretty difficult from an outside perspective. There are lots of issues related to this problem, some of them very general, some of them with large discussions, some with just technical jargon focused on fixing the technical issue at hand, and so on. Extracting a broader overview on the state of the problem from that data is not easy to do.
I regularly find myself in a situation where I'm acting as an ambassador of sorts for the ecosystem (as I'm sure many other admins of their local communities do). This particular issue has been a real hotspot on that front, that's why it would be tremendously helpful in order to give confident answers to users, if there was an update on this the high level issue once in a while.
Hopefully this doesn't come off as too pushy. We all care about Matrix and want it to succeed :v:
Extracting a broader overview on the state of the problem from that data is not easy to do.
Absolutely. I can't wait for the day when we can say "it's fixed!" Until then it's a bit of a whack-a-mole situation: when we fix something we try to be sure that we are definitely improving the behaviour, and we try to be super-sure we're not breaking any other fixes, but we don't usually know which symptoms will go away after a specific fix, so it's tricky to give proper updates, but we'll try to do better in future, by posting updates here.
I regularly find myself in a situation where I'm acting as an ambassador of sorts for the ecosystem (as I'm sure many other admins of their local communities do). This particular issue has been a real hotspot on that front, that's why it would be tremendously helpful in order to give confident answers to users, if there was an update on this the high level issue once in a while.
Thank you for your work! I am personally really aware of how painful this has been, and that can't have made it easy for you as an advocate.
Hopefully this doesn't come off as too pushy. We all care about Matrix and want it to succeed v
Not at all. Thank you for the feedback.
Hopefully this doesn't come off as too pushy. We all care about Matrix and want it to succeed :v:
Same here, and as most people who have been closely following the issues related to threads these last few months (since they have been out of beta and sometimes even before), I want to continue advocating for Matrix/Element and support all the good people working on it.
Considering that these issues are various, complex, unpredictable and should take a lot of time and effort until threads work "as expected" (which is in itself quite a challenge, given how user expectations can be subjective!), wouldn't it be possible to roll them back (disable Threads by default and/or make them a Labs feature again)? I'm thinking it would also release at least some of the pressure to take the time needed to work on this with proper conditions.
Thanks everyone for bearing with us. We are aware of the problems and, in fact, we do suffer ourselves as we're all using Element for work. We're working on resolving this situation with the resources currently available to us. I acknowledge that we haven't always been good at communicating status externally. We will try to improve this going forward. I've updated the issue description above with a summary of the currently known problems and a high-level action plan for our next steps.
our early work uncovered tons of new issues that were masked by the fact that we had a bug incorrectly marking threads as read when they were not.
I guess this explains why threads have become particularly noisy recently 😅. I appreciate that it’s important to get everything working as it should, but there are a few buttons/behaviours that I imagined would be there that would help immensely, and would probably be easier (quicker) to implement: in order from “expected feature” to “bandaid fix”:
Personally I’ve gone from checking matrix whenever there is a notification to checking it whenever I get bored- while I imagine this looks good on my usage stats, it’s murdering my productivity in the same way Reddit and Twitter do 😁. I don’t like complaining, but would very much like to see this fixed so Matrix can be restored to it’s rightful status as awesome and productive chat app :)
@ExplodingWaffle we can't currently reliably mark threads as read due to the technical issues explained in the issue description. While I empathize with some of your thoughts, this is also not the right place to discuss thread-related product changes. We're purely concerned with making notifications not stick after you've read them here.
We've also had the problem with persistent or recurring notifications for awhile. What I noticed is that it only affects edited messages with mentions. Maybe it helps.
Can we at least ask for a button "mark every single message as read across all spaces"?
Even restarting Element doesn't help anymore, and I can't understand which channels do have new messages and which aren't.
Can we at least ask for a button "mark every single message as read across all spaces"?
Already in Settings > Notifications
Already in Settings > Notifications
Hmm, not for me (using 1.11.35, electron version)
@mutantcornholio then the app isn't seeing that sending manual read receipts would fix the issue.
I updated the summary at the top of this issue to reflect our current understanding. This week I will be pushing https://github.com/matrix-org/matrix-spec-proposals/pull/4033 and trying to debug some of the known issues to identify their causes.
@mutantcornholio then the app isn't seeing that sending manual read receipts would fix the issue.
Maybe it should also remove every black circle, just from UI?
Otherwise, it's more like "resend read receipts", not "mark all as read"...
They'd just come back when you restart the app
The thing is, I'm constantly trying to restart the app in order to get stuck notifications to disappear (helps in some cases).
Currently, the button does not mark some messages as read at all. If it would mark all of the messages as read, but some of that would come back after restart, I'd consider that "less broken".
Contributions welcome, but I doubt there'll be any bandwidth from the team to work on anything other than the final solution
@mutantcornholio then the app isn't seeing that sending manual read receipts would fix the issue.
Maybe it should also remove every black circle, just from UI? Otherwise, it's more like "resend read receipts", not "mark all as read"...
At the risk of piling on, this highlights one of the big disconnects for me. As a user viewing my local UI, it doesn't matter to me whether or not read receipts have been send, received, or whatever. Notifications are an entirely local experience. When I have seen the messages, I consider them read regardless of any network requests. I think this will always be a frustrating experience (and prone to bugs like this) when the UI that tells me whether or not I have read a message is tied to state or network requests that are entirely unrelated to that.
@bhearsum I hear your pain, and I am experiencing it too, but I think we shouldn't give up on read states working properly. It's a brilliant feature that if you read a message on your PC it is also marked as read on your phone, so we're working hard to fix the bugs so that this stuff works properly.
Sending all the dev working on resolving this issue lots of good luck and fortitude! We run our startup's chat infra on matrix/element and this bug has lead almost to a team mutiny. Hope it get's resolved soon!
Sending all the dev working on resolving this issue lots of good luck and fortitude! We run our startup's chat infra on matrix/element and this bug has lead almost to a team mutiny. Hope it get's resolved soon!
We ended up switching... People were missing important messages, things got lost, we end up keeping our server as an archive, but considering the time it took to only start active investigation, the decision has been made to switch from Element (as the only very much active and full-featured client of Matrix).
It only makes sense for me in security related sphere, but given that there is no single reliably working client (FluffyChat is a good candidate, but luck active bugs fixing), it's a problem as well.
Maybe this is a crazy idea, but given the gravity of the situation here (folks are switching away), how about reverting the fix for the issue that exposed all of this? It seemed to work well enough before.
Keep it in a branch, and then get all of the follow-up issues working in there before merging. What do you all think?
Maybe this is a crazy idea, but given the gravity of the situation here (folks are switching away), how about reverting the fix for the issue that exposed all of this? It seemed to work well enough before.
At this point we'd need to revert 20-30+ PRs over 2 projects, the likelihood of missing one and making the situation even worse is high.
To add to that, this isn't a trivial regression where a commit introduced a bug and can be backed out again. We started seeing a class of stuck notification problems after threads moved out of labs. Going back to the working state before would require not just reverting the various incremental fixes made so far but also re-labsing threads (since we have made the fixes for a reason).
To add to that, this isn't a trivial regression where a commit introduced a bug and can be backed out again. We started seeing a class of stuck notification problems after threads moved out of labs. Going back to the working state before would require not just reverting the various incremental fixes made so far but also re-labsing threads (since we have made the fixes for a reason).
To be perfectly honest, thread’s didn’t work very well in that perspective when in the labs. Probably better then now, but it could be just a cumulative thing
Yes, point taken very much that threads probably moved out of beta prematurely.
The only thing I don't understand is: Why can't the apps temporarily store the message-IDs that have been read locally (or at least the ID of the last read message in a room if that's too much data) until this is resolved? This would alleviate the problem a lot. It doesn't happen on mobile for me, BTW. And I don't have threads enabled either.
Folks, this is a complicated issue spanning multiple months of work and many PRs already, it's escalated as far as can be. Piling more pressure on top or "how hard can it be, did you try simply doing X?" is unlikely to move things along faster at this point.
Let's please try and keep this issue focused on progress reports from the team to make it easy for everyone to follow along.
We are experiencing multiple issues in the area of "stuck" notifications and unread markers. Many are related to the way receipts interact with threaded messages.
Symptoms
We are seeing different symptoms of the problem:
Spec-level causes
Message ordering
Fundamentally, in order to interpret the meaning of a receipt that says "I have read everything up to here", we need to know what order messages are in. This is not clear in the spec, and we propose to make it clear and explicit in MSC4033.
In the meantime, Element Web uses a combination of "sync" order (the implicit order of events arriving via a /sync request) and "timestamp" order (using the
ts
property within events).Some of the existing bugs are probably caused by this inconsistency, but it is not clear yet how many: we believe there are also bugs in the implementation that cause additional problems, and this theoretical inconsistency is only the cause of a few problems.
Which thread the root belongs to
The spec has what we consider a bug when it talks about which thread the root message belongs to, which has been reflected in client code, making it inconsistent with the server implementation (at least on the Synapse server). We have a proposal to fix this bug in MSC4037.
Identifying which thread any message is in
It is sometimes difficult for clients to identify which thread an event belongs to, meaning that a receipt pointing to it is sometimes ignored. We have begun drafting MSC4023 to address this.
Other
Previously, we believed that MSC3981 (recursive relations) would solve some of the problems, but since that MSC does not solve the event-ordering problem (because the events from the /relations API are returned in "topological" order) we no longer believe it is important, except as a performance optimisation. Code-level causes
Code-level causes
We have found and fixed several bugs in the Element Web code that were caused by an incomplete understanding of the meaning of threaded and unthreaded read receipts. We anticipate that some more exist.
(We believe that the primary reason why we're not seeing the same problems on mobile is that the apps persist events they've received whereas Element Web has to re-fetch from scratch after every launch. As a result, any issue in the unread state logic, strikes again and again. The apps also use a single timeline whereas Element Web maintains one timeline per thread in addition to the main timeline in every room.)
High-level plan of actions
We believe a lot of progress can still be made without spec changes. So we're slightly deprioritising work on the MSCs.
New issue inbox
The following is a holding area for newly reported issues that require review. Once reviewed, issues should either be moved to one of the other task lists below or, if not applicable, removed from this epic.
Tasks not blocked by spec work
Tasks that are related to or dependent on spec work
We've written the following MSCs to try and address the root causes in a reliable and performant way:
/event
to fetch the parent has been implemented. This is functionally correct but has a noticeable performance impact.Issues that are related but out of scope
Time sheeting
WEB: Stuck notifications