Old events can't be loaded after joining a room; reconnecting fixes

MagicRB commented 1 year ago

ement version: 9.0-pre emacs version: some master revision, recent pantalaimon: in use, custom fork but same behavior on the upstream version

If i attempt to scroll beyond the available messages, I get this error, which also causes minor Emacs freezes.

error in process sentinel: Ement API error: "Loading 30 earlier messages failed", #s(plz-error nil #s(plz-response 1.1 400 ((access-control-allow-headers . "Origin, X-Requested-With, Content-Type, Accept, Authorization") (access-control-allow-methods . "GET, POST, PUT, DELETE, OPTIONS") (access-control-allow-origin . "*") (content-type . "application/json") (content-length . "61") (date . "Mon, 15 May 2023 10:09:07 GMT") (server . "Python/3.10 aiohttp/3.8.3")) "{\"errcode\":\"M_UNKNOWN\",\"error\":\"'from' parameter is invalid\"}") nil) [2 times]

alphapapa commented 1 year ago

I just tested it, and it works for me.

The error message you're getting from the server:

"{\"errcode\":\"M_UNKNOWN\",\"error\":\"'from' parameter is invalid\"}"

suggests that it's specific to a certain room. So does it happen on every room you're in?

MagicRB commented 1 year ago

It doesn't happen all the time but also not specific to a room, it happens to all rooms sometimes.

alphapapa commented 1 year ago

I just tried it on four rooms that I'm in, and they all work fine.

Can you reproduce the problem if you try it in public, unencrypted rooms without connecting through Pantalaimon?

What homeserver do you use? Does it happen on a matrix.org account?

MagicRB commented 1 year ago

Ill test everything tomorrow, for the stuff i can now, I use synapse, almost latest but it has been happening for a long time.

alphapapa commented 1 year ago

So you're running your own homeserver? If so, I'd suggest checking some details of the rooms that are exhibiting this problem. For example, are they an old room version? Do they have any abnormalities, any corrupt data in their state? If you make a new room, does the problem still happen (after adding enough new events to the timeline to allow scrolling back, of course)? What if you join a room hosted on a different homeserver, one that is decently sized and busy, and try to scroll back to old events (e.g. #debian:matrix.org, or #emacs:libera.chat)?

AFAIK Ement correctly handles the from/to tokens returned by the server with event chunks. This error would suggest that the server is missing history for that room, or something like that. But I haven't run a homeserver before, and I don't know much about the server-to-server protocol.

alphapapa commented 1 year ago

From talking with another user just now, it seems that the issue is that, after joining a room, the from token is not set correctly. But disconnecting and reconnecting solves the problem, allowing old events to be loaded, which suggests that the token is received in the initial sync.

phil-s commented 1 year ago

I just hit this testing against the staging instance of our local Matrix server, which is running version 1.85.2.

{"errcode":"M_UNKNOWN", "error":"'from' parameter is invalid"}

The logged request from ement was passing from=nil:

GET /_matrix/client/r0/rooms/%21CGHZxpnaATGInFnfJm%3A<domain>/messages?from=nil&dir=b&limit=300&filter=%7B%22lazy_load_members%22%3Atrue%7D

My test cases are very limited (there were only a couple of rooms on this server, both with short histories), but my impression is this only happens when I've already reached the beginning of the room history and there are no more messages to be fetched, as I was able to fetch what history existed before encountering this error.

Thus far it seems to be happening consistently for me on this server, including when I tried creating a new room and then using page-up to scroll back in that. I've not yet tried disconnecting and reconnecting; I figure if it's in a testable state now, I shouldn't mess with it.

I can't confirm that the server version is relevant, but I don't recall ever seeing this happen with our production instance (which is currently running a much older v1.47.1 of the server software). In the rooms I've tested there in the same state (all history already fetched) just now, I just get these messages with no error:

Loading 300 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 0 earlier events.

As opposed to:

Loading 300 earlier messages...
error in process sentinel: Ement API error: "Loading 300 earlier messages failed", #s(plz-error nil #s(plz-response 2 400 ...

alphapapa commented 1 year ago

@phil-s Ok, I guess we need to ensure that the prev-batch token is non-nil before sending a request: https://github.com/alphapapa/ement.el/blob/909abd44427ba1d3b9e83eda422c520dbe55880b/ement-room.el#L1368-L1397

phil-s commented 1 year ago

Aha, https://spec.matrix.org/latest/client-server-api/#get_matrixclientv3roomsroomidmessages says from became optional in version 1.3. I would guess that related changes meant that the token values that from would be sourced from were also not guaranteed to be provided.

The token to start returning events from. This token can be obtained from a prev_batch or next_batch token returned by the /sync endpoint, or from an end token returned by a previous request to this endpoint.

This endpoint can also accept a value returned as a start token by a previous request to this endpoint, though servers are not required to support this. Clients should not rely on the behaviour.

If it is not provided, the homeserver shall return a list of messages from the first or last (per the value of the dir parameter) visible event in the room history for the requesting user.

Changed in v1.3:

Previously, this field was required and paginating from the first or last visible event in the room history wasn’t supported.

phil-s commented 1 year ago

To me that sounds like ement.el could (in the absence of prev-batch) check to see if it had a start value and if so use that value; otherwise... it's not clear to me whether the right thing is to issue the request without a from parameter, or to not issue a request at all? (The former sounds like it wouldn't obtain any events the client didn't already have, so would be pointless? Or did I misinterpret that?)

phil-s commented 1 year ago

Oh; no I'm missing that "paginating from the first visible event" in this case is moving backwards in history to earlier events. So making the request without from seems like the right thing.

phil-s commented 1 year ago

I've tested this simple change, and I believe it's working as desired in my case.

The first change avoids errors in notification/mention buffers.

modified   ement-room.el
@@ -1376,16 +1376,17 @@ see."
                        (list (read-number "Number of messages: "))
                        (number current-prefix-arg))
                      :buffer (current-buffer)))
-  (unless ement-room-retro-loading
+  (unless (or ement-room-retro-loading (not room))
     (pcase-let* (((cl-struct ement-room id prev-batch) room)
                  (endpoint (format "rooms/%s/messages" (url-hexify-string id))))
       ;; We use a timeout of 30, because sometimes the server can take a while to
       ;; respond, especially if loading, e.g. hundreds or thousands of events.
       (ement-api session endpoint :timeout 30
-        :params (list (list "from" prev-batch)
-                      (list "dir" "b")
-                      (list "limit" (number-to-string number))
-                      (list "filter" (json-encode ement-room-messages-filter)))
+        :params (nconc (and prev-batch
+                            (list (list "from" prev-batch)))
+                       (list (list "dir" "b")
+                             (list "limit" (number-to-string number))
+                             (list "filter" (json-encode ement-room-messages-filter))))
         :then then
         :else (lambda (plz-error)
                 (when buffer

phil-s commented 1 year ago

Ah, there's still an issue here. Once at the beginning of the room history, repeated calls to ement-room-retro are alternating between fetching zero messages and fetching the first page again. The end result is that the room buffer is showing me N duplicates of each event.

Clearly the room event data is also stored somewhere outside of the room buffer, as killing the buffer and re-visiting the room gets me all the same duplicates.

(Is there way to purge the data for the current buffer's room?)

alphapapa commented 1 year ago

Purging events shouldn't be necessary. I think we don't fully understand the problem here and the intended solution.

phil-s commented 1 year ago

Purging events shouldn't be necessary.

Perhaps, but it would seem useful for debugging. I'd certainly like to be able to do this and then view the room again as if it was the first time I'd done it this session.

ement-forget-room isn't what I'm after, but I do see the following in its code, which seems like it might be the thing?

(setf (ement-session-rooms session)
      (cl-remove room (ement-session-rooms session)))

alphapapa commented 1 year ago

I'm afraid it's not quite that simple. If you removed the room struct from the session's rooms slot, you'd also need to replay the room's creation and metadata events in order for it to be properly recreated.

You can "view the room again as if it was the first time I'd done it this session" by simply killing the room's buffer and using ement-view-room again. But a room buffer is only a view of the room's data; we don't create a buffer for a room unless the user chooses to view it.

Maybe you could try something like this:

Make a new room struct with the same room ID. Assign it to a variable that's not part of Ement.
Use the appropriate Ement commands and functions on that room struct to retrieve timeline events, adjusting the from/prev-batch token as needed.

Although that may cause errors unless you avoid processing the events normally, which would try to add them to the appropriate structs in the session. Basically, I guess you just want to issue API calls, which you could do by just calling ement-api with the session struct and the appropriate URL and JSON body.

But again, I don't think we (or at least, I) fully understand the problem here. Looking again at my comment here: https://github.com/alphapapa/ement.el/issues/148#issuecomment-1561358994, I'm guessing that the problem is that we aren't recording the token correctly when we receive the event indicating that the user joined the room, but that the same token is correctly recorded when received in an initial sync request.

Another possibility is that this is related to rooms' history visibility settings, that we aren't recognizing when a room doesn't allow the user to view history from before the user joined.

Anyway, it would be best if we could define a reproducible test case (e.g. involving creating a test room, maybe on a second account, and then joining on the main account and reproducing the problem). That would probably make it easy to understand and solve.

phil-s commented 1 year ago

You can "view the room again as if it was the first time I'd done it this session" by simply killing the room's buffer and using ement-view-room again.

That wasn't my experience when I was seeing duplicate events. The Matrix server didn't have duplicates of the events, and the first time I visited the room there were no duplicates. But Emacs had ended up with dups, and killing the buffer and re-visiting the room from the room list created a new room buffer with all the duplicates, rather than recreating what I'd seen when initially visiting the room.

(After disconnecting and reconnecting, things went back to normal.)

alphapapa commented 1 year ago

That wasn't my experience when I was seeing duplicate events. The Matrix server didn't have duplicates of the events, and the first time I visited the room there were no duplicates. But Emacs had ended up with dups, and killing the buffer and re-visiting the room from the room list created a new room buffer with all the duplicates, rather than recreating what I'd seen when initially visiting the room.

Ok, I must have misunderstood what you meant. We really do need a set of steps to reproduce the problem. :)

phil-s commented 1 year ago

I'm planning to spend some time on ement on Friday, so I'll take another look at this then and hopefully other things as well. I don't know how much I'll get done, but I'm mentioning it in case you have some hot tips on debugging.

One thing is that some absolutely huge data structures are thrown around, so I'm curious as to whether you use any particular techniques for managing these.

Another thing I was wondering about was general Matrix event logging. I guess ement-event-hook and the ement-defevent definitions are the things to use. Do you have any handy systems for logging events when debugging?

Any tips you may have on these points or anything else are welcomed :)

alphapapa commented 1 year ago

I'm planning to spend some time on ement on Friday, so I'll take another look at this then and hopefully other things as well. I don't know how much I'll get done, but I'm mentioning it in case you have some hot tips on debugging.

One thing is that some absolutely huge data structures are thrown around, so I'm curious as to whether you use any particular techniques for managing these.

This can indeed be a problem. Incidentally, it used to be much worse, before I changed the data structure to not have circular references (rooms pointed to their events which pointed back to their rooms). I had to unfreeze or kill -SIGUSR2 a number of times. :)

Having made that change, with recent Emacs versions I don't often have this problem anymore. But if it does happen, it can be mitigated by setting print-level and print-length to small values. (Of course, those should remain nil by default, otherwise various things in Emacs that print values and assume that those variables are nil may not behave as expected, which can manifest in unexpected and unpleasant ways. But it's usually no problem to set them to nil while working on such a problem. Just probably a good idea to set them back to nil before exiting Emacs or doing anything that might persist some data.)

Another tip might be to use the inspector package to look at internal data structures when needed. AFAIK it can handle large and recursive structures without performance problems.

Other than that, I sometimes use M-: and eval various variables with or without various accessors. Occasionally I might write a few lines of code to call cl-find or something to pick out a particular thing or count how many of a thing there are, etc.

Another thing I was wondering about was general Matrix event logging. I guess ement-event-hook and the ement-defevent definitions are the things to use. Do you have any handy systems for logging events when debugging?

You could certainly use those to do that, but I haven't needed to yet.

Any tips you may have on these points or anything else are welcomed :)

You might take a look at the ement-debug macro. It allows a way to insert useful debugging statements without writing boilerplate and have it be compiled out unless enabled at compile time (and then only activates if enabled at runtime also). So you can feel free to sprinkle it around and not worry about performance. I might submit it upstream someday; when I asked, it was suggested that I do so, but my hestitation is the little hack it uses to get the containing function's name at compile time (which seems to work fine, but it is a hack).

If you have any other questions, please feel free to ask. I appreciate your help.

phil-s commented 1 year ago

That all sounds really helpful, thank you! I'm not familiar with inspector and for some reason hadn't even noticed ement-debug. I did play with backtrace-line-length a bit, but the default value isn't bad (nil is definitely bad :)

alphapapa commented 1 year ago

I just encountered this problem myself. As mentioned, this change does indeed fix the error:

diff --git a/ement-room.el b/ement-room.el
index a7eb288..a8444f7 100644
--- a/ement-room.el
+++ b/ement-room.el
@@ -1382,10 +1382,12 @@ (cl-defun ement-room-retro
       ;; We use a timeout of 30, because sometimes the server can take a while to
       ;; respond, especially if loading, e.g. hundreds or thousands of events.
       (ement-api session endpoint :timeout 30
-        :params (list (list "from" prev-batch)
-                      (list "dir" "b")
-                      (list "limit" (number-to-string number))
-                      (list "filter" (json-encode ement-room-messages-filter)))
+        :params (remq nil
+                      (list (when prev-batch
+                              (list "from" prev-batch))
+                            (list "dir" "b")
+                            (list "limit" (number-to-string number))
+                            (list "filter" (json-encode ement-room-messages-filter))))
         :then then
         :else (lambda (plz-error)
                 (when buffer

But as Phil said, some of the retrieved messages end up being duplicates. Shouldn't be too hard to dedupe them...

alphapapa commented 2 months ago

For future reference, and FWIW, I went down the rabbit hole and arrived at this: https://github.com/matrix-org/matrix-js-sdk/blob/687d08dc9d9c618c628ef59031fee2f8f81554cd/src/models/event-timeline-set.ts#L418 That shows how the Matrix JS SDK handles room timelines and retrieving old events and adding the ones it receives to the timelines it maintains for a room. I was hoping there was some trick that I just didn't know about, since I can't find any special behavior for getting the first from token when joining a room. But apparently there's no trick, they just expect to get some duplicate events, and they discard them. So I guess that's what we need to do: deduplicate events received when loading old events. And that's easy enough for us, since we maintain a hash table of event IDs.

alphapapa commented 2 months ago

I think this branch will fix the problem: https://github.com/alphapapa/ement.el/compare/wip/148-retro-after-join If anyone would like to test it, please do.

phil-s commented 1 month ago

I've applied this patch and done some initial testing. I'm not seeing the "'from' parameter is invalid" error any more. It's not quite working right yet, though. I now get the following error, and the room buffer is not updated.

Loading 100 earlier messages...
Ement: Processing earlier events... 
Error running timer `plz--respond': (wrong-type-argument ement-event nil)

However the retro messages do in fact make it into the session -- if I kill the buffer and revisit the room, I see them.

The error only happens the first time I try. Subsequent use of the command has no obvious effect.

I've not tried to debug this, but I suspect the vector argument to ement-room--process-events here is the notable portion of the backtrace:

Debugger entered--Lisp error: (wrong-type-argument ement-event nil)
  signal(wrong-type-argument (ement-event nil))
  (or (progn (and (memq (type-of event) cl-struct-ement-event-tags) t)) (signal 'wrong-type-argument (list 'ement-event event)))
  [...]
  (setq handler (alist-get (progn (or (progn (and (memq (type-of event) cl-struct-ement-event-tags) t)) (signal 'wrong-type-argument (list 'ement-event event))) (aref event 5)) ement-room-event-fns nil nil #'equal))
  [...]
  ement-room--process-events([nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ...])
  ement-room-retro-callback(...)

alphapapa commented 1 month ago

Hi Phil,

Apologies, then; I wrote and reviewed the code carefully, but apparently not carefully enough. (Maybe it was too late at night to be coding, haha.) I'll let you know when I have another update.

phil-s commented 1 month ago

               finally do
               (setf (ement-room-timeline room) (append (ement-room-timeline room)
                                                        (seq-remove #'null chunk))))

Looks like up there we're putting a nil-less copy of chunk into the timeline, but down below we're using the one with the nil values:

      (when buffer
        ;; Insert events into the room's buffer.
        (with-current-buffer buffer
          (save-window-excursion
            ;; NOTE: See note in `ement--update-room-buffers'.
            (when-let ((buffer-window (get-buffer-window buffer)))
              (select-window buffer-window))
            ;; FIXME: Use retro-loading in event handlers, or in --handle-events, anyway.
            (ement-room--process-events chunk)

(Maybe it was too late at night to be coding, haha.)

I'm familiar with that phenomenon :/

phil-s commented 1 month ago

I'll test with this:

modified   ement-room.el
@@ -2520,7 +2520,8 @@ ement-room-retro-callback
                ;; likely very few compared to the number of timeline events, which is
                ;; what the user is interested in (e.g. when loading 1000 earlier
                ;; messages in #emacs:matrix.org, only 31 state events were received).
-               (progress-max-value (* 3 num-events)))
+               (progress-max-value (* 3 num-events))
+               (chunk-unseen nil))
     ;; NOTE: Put the newly retrieved events at the end of the slots, because they should be
     ;; older events.  But reverse them first, because we're using "dir=b", which the
     ;; spec says causes the events to be returned in reverse-chronological order, and we
@@ -2554,8 +2555,9 @@ ement-room-retro-callback
                     (ement--put-event event nil session))
                (ement-progress-update)
                finally do
-               (setf (ement-room-timeline room) (append (ement-room-timeline room)
-                                                        (seq-remove #'null chunk))))
+               (setf chunk-unseen (seq-remove #'null chunk)
+                     (ement-room-timeline room) (append (ement-room-timeline room)
+                                                        chunk-unseen)))
       (when buffer
         ;; Insert events into the room's buffer.
         (with-current-buffer buffer
@@ -2564,7 +2566,7 @@ ement-room-retro-callback
             (when-let ((buffer-window (get-buffer-window buffer)))
               (select-window buffer-window))
             ;; FIXME: Use retro-loading in event handlers, or in --handle-events, anyway.
-            (ement-room--process-events chunk)
+            (ement-room--process-events chunk-unseen)
             (when set-prev-batch
               ;; This feels a little hacky, but maybe not too bad.
               (setf (ement-room-prev-batch room) end))

alphapapa commented 1 month ago

Thanks, Phil, good eye.

phil-s commented 1 month ago

I've been running with these changes (including the diff above) for a while now, and I've just done some more purposeful testing, and I believe this is behaving correctly.

I haven't seen an error when trying to fetch more history, and when I try to do so when already at the beginning of the history I don't get any duplicates.

I do get some inconsistent messages which I don't understand, and it would probably be nice if Ement could say something like "No prior history available", but my impression is that this could be merged as-is to improve things.

This example was from repeatedly calling ement-room-scroll-down-command at the start of the buffer when all history was already present:

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 100 earlier events.

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 100 earlier events.

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 6 earlier events.

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 0 earlier events.

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 100 earlier events.

alphapapa commented 3 weeks ago

I pushed a couple more fixes to the branch: https://github.com/alphapapa/ement.el/compare/wip/148-retro-after-join Including one for the behavior noted in https://github.com/alphapapa/ement.el/issues/148#issuecomment-2308660732 Now the room's token is not cleared if the response doesn't include one, so it doesn't go through that looping pattern.

Please let me know if this branch still seems to work correctly. If it does, I'll merge it.

phil-s commented 3 weeks ago

Please let me know if this branch still seems to work correctly. If it does, I'll merge it.

It works well so far as I can see -- and once I've obtained all the history I now consistently see:

Loading 100 earlier messages...
Ement: Processing earlier events... 
Ement: Loaded 0 earlier events.

So that seems good.

alphapapa commented 3 weeks ago

@phil-s Thanks for your help testing.

@MagicRB Thanks for reporting this, and for your patience.

MagicRB commented 3 weeks ago

I wish I could have helped more, i ended up switching off ement.el due to a relience on E2EE and pantalaimon being just bad... but i do plan to come back in the near future

alphapapa / ement.el

Old events can't be loaded after joining a room; reconnecting fixes #148