ProtonMail / proton-bridge

Proton Mail Bridge application
GNU General Public License v3.0
1.14k stars 155 forks source link

Message UIDs are not stable / possible data loss (deletion of wrong messages) #220

Closed richardweinberger closed 1 year ago

richardweinberger commented 2 years ago

My colleges and myself facing wired issues with ProtonMail Bridge and various mail clients (getmail, Kmail, Apple Mail). Symptoms include:

All our mailboxes are large (50k to 100k mails) . This is maybe why we're facing the issue so easily.

Finally I found the proof that UIDs as presented by ProtonMail Bridge are not stable: As of now the bridge uses UID 51950 for the most recent message in my inbox.

a SELECT INBOX
* FLAGS (\Seen \SEEN \Flagged \FLAGGED \Deleted \DELETED \Draft \DRAFT $Junk Junk NonJunk)
* OK [PERMANENTFLAGS (\Seen \SEEN \Flagged \FLAGGED \Deleted \DELETED \Draft \DRAFT $Junk Junk NonJunk)] Flags permitted.
* 51950 EXISTS
* 0 RECENT
* OK [UIDNEXT 51952] Predicted next UID
* OK [UIDVALIDITY 4] UIDs valid
a OK [READ-WRITE] SELECT completed
a FETCH 51950 (FULL)
* 51950 FETCH (FLAGS (\Seen NonJunk) INTERNALDATE "29-Sep-2021 21:00:02 +0200" RFC822.SIZE 7129 ENVELOPE ("Wed, 29 Sep 2021 21:00:02 +0200" "Cron XXXX" (("(Cron Daemon)" NIL "root" "CENSORED")) (("(Cron Daemon)" NIL "root" "CENSORED")) NIL ((NIL NIL "root" "CENSORED")) NIL NIL NIL "<20210929191942.2F34CEBAD@CENSORED>") BODY ("text" "plain" ("charset" "utf-8") NIL NIL "quoted-printable" 4580 185))
a OK FETCH completed

As you can see UID 51950 is a mail from my cron daemon with date 29.09.2021 21:00. UIDVALIDITY is 4.

When I look at the getmail status file from my backup of 27.09.2021, I see the following line: 4/51950^@1632752979

The line decodes as "getmail downloaded mail with UID 51950 and UIDVALIDITY 4 on UNIX time 1632752979 (27.09.2021 - 16:29:39)"

This is the reason why I don't get the above mail delivered into my Maildir! getmail thinks it has already downloaded that message and skips it.

Expected Behavior

UID/UIDVALIDITY pairs should be stable such that mail clients have a chance to detect what mails are new.

Version Information

ProtonMail Bridge br-1.8.9 getmail 6.18.4

vmivanov commented 1 year ago

A slight aside, disabling the Bridge cache does not resolve the issue - I've been running Bridge with disabled cache for many months, since before coming across this report. Certainly not a Bridge cache issue.

Somebody in the HN thread above had made this suggestion, so I thought I'd share my experience - it won't work.

nicowilliams commented 1 year ago

@vmivanov I assume the bridge needs to be stable (see my earlier comments), so that disabling its cache could cause the problem.

vmivanov commented 1 year ago

@nicowilliams, unfortunately that's not the case. The cache is enabled by default and this issue was/is present so I disabled it as a result. Many clients also maintain a local cache, or at least a message index - my rationale was that disabling the Bridge cache could potentially exclude one component of the equation since the mail client could be caching a copy of the data from Bridge's cache.

What I also wanted to try is disabling Thunderbird's cache and/or index so that every time a folder is queried it would force a full IMAP sync with the server, which could - in theory - work around this problem, albeit rather inconveniently, by always fetching a fresh copy of the message metadata. Unfortunately, I've not found a way force such behaviour even through about:config.

As for your comment on RFC compliance, that may well be the case. But IMAP has been about for a long time, I doubt every single widely adopted implementation would be non-compliant. It's not impossible, but all of the examples that @richardweinberger gave are extremely mature projects and something as weird as this would have become very obvious very soon (as with Bridge).

ploum commented 1 year ago

I find it strange that this bug first occured to me this Sunday then today with 3 others emails while I had been using the bridge without major problems since 2018. Also, this bug made the headlines on HN the same week I was hit and my setup is basically not usable (relying on protonmail-bridge to work offline on my email and making me consider a panic migration to another provider)

So my question is : did I hit some invisible threshold which, by pure luck, was close to the random HN buzz or did something happened this week that made the bug a lot more probable than it was before? (prompting users to look for it and post it on HN)

(For me the bug appears as mail being duplicated in mutt and taking place of others. I see AAA instead of ABC but in my mail list but archiving the second A in mutt will, in reality, archive B on the server)

ljanyst commented 1 year ago

The original post of @richardweinberger is not a proof of UIDs being unstable. The EXISTS response returns number of messages in the mailbox. Proton bumps UIDs by 1 for every new message. Therefore, the most recent message in the inbox had the UID of 51951 (UIDNEXT-1).

Have you guys stress tested bbolt. It may not be as thread-safe as it advertises.

richardweinberger commented 1 year ago

@ljanyst But https://github.com/ProtonMail/proton-bridge/issues/220#issuecomment-972226749 is IMHO a proof.

ljanyst commented 1 year ago

I am not saying what you say is false in general. In fact, I experience the issue myself. Just trying to get to the root of it.

ljanyst commented 1 year ago

Shouldn't bucket.NextSequence() fail in a View transaction as it modifies the bucket?

ljanyst commented 1 year ago

In fact, bbolt may schedule multiple read-only transactions at the same time and you may get a race.

dmtucker commented 1 year ago

did I hit some invisible threshold which, by pure luck, was close to the random HN buzz or did something happened this week that made the bug a lot more probable than it was before?

IDK what changed, but this problem definitely got worse for me recently. I had hit it before, but I hit it more frequently now.

ljanyst commented 1 year ago

That's true. It used to happen a lot for me with Thunderbird, but almost never with AppleMail. Since three or four days, I see it in AppleMail often too. I run the bridge version from a couple of months ago. So either something changed in AppleMail after update to Ventura or something changed to API IDs of the messages.

bergjs commented 1 year ago

I recently could not find an email that I was sure I received and now I know why. Logging into the webmail, I could find it, but not in Apple Mail. This makes the Protonmail Bridge practically unusable, which is a huge problem that should be addressed immediately.

ljanyst commented 1 year ago

After some more thought: How does bolt behave in case of unclean bridge shutdowns? Are the transactions peristed on transaction commit or do they stay transient in memory and get committed later? In particular, is the sequence of a bucket preserved? Bolt had a similar bug in the past: https://github.com/boltdb/bolt/issues/296

dsommers commented 1 year ago

IDK what changed, but this problem definitely got worse for me recently. I had hit it before, but I hit it more frequently now.

I am seeing these issues far more often now too, and it started after upgrading to Proton Mail Bridge v2.3.0. On my "IMAP backup server", which uses Bridge + offlinesync, I had to downgrade to v2.1.3 (the prior release I ran there), as it got just completely unreliable, where offlinesync would hang and lots of "cache deletion errors" in the Bridge.

On my desktop (with Bridge + Thunderbird) yesterday, it worked fine. Sending mails does takes a bit longer than normally. On Tuesday, I had to run the older v2.1.3 release, as it was completely unstable and unreliable again. It seems to vary a lot with the v2.3.0 release. Unless there are some server side changes happening on top of everything.

ploum commented 1 year ago

What is that I was using bridge 2.3.0 since its release without notable problem. Problems started to happen last week without I changed anything (I was on holiday on a weak/intermittent connection so I know for a fact that I didn’t changed anything on my setup). As the comments point the problem becoming more prevalent last week, at roughly the same time, this same to indicate that something changed on proton servers themselves (which might have exposed a previously rare bug in the bridge). But it looks like something changed on proton servers.

christophetd commented 1 year ago

+1 similar issue with Thunderbird. Some messages being marked as read, but still (correctly) unread on the webmail.

Amunak commented 1 year ago

All of this is even harder for IMAP<->whatever bridges, since if you want stable UIDs you'll need the bridge to keep state. But users and bridge devs want bridges to be stateless. Basically, a proxy that serves up IMAP can't be entirely stateless w/o help from the backend. If the backend is IMAP and has stable {UID, UIDVALIDITY}, or is not IMAP but has that feature anyways in order to help protocol transition proxies, then those can be stateless. But a general-purpose IMAP proxy/bridge really can't assume {UID, UIDVALIDITY} stability in the backend -- at best it can require it.

I mean Proton already requires you to use their crappy software; what they should've done a year ago is a dirty but simple fix: assign all messages a unique ascending ID as they arrive, pass it with other metadata onto Bridge, and let bridge translate those IDs to IMAP language (and back). No state (in Bridge) needed, and they should be able to push a fix like that in a week or two.

The fact that it's taken years and still isn't fixed is mind-boggling.

nicowilliams commented 1 year ago

@Amunak yes of course, they could and should have solved it much as you described.

richardweinberger commented 1 year ago

@Amunak yes of course, they could and should have solved it much as you described.

Did you look into the code?

ljanyst commented 1 year ago

@Amunak That would have indeed matched their labeling architecture better. @richardweinberger I did. I even forked and refactored it for my own purposes (https://github.com/ljanyst/peroxide). I don't see any obvious errors in UID handling. I could probably figure it out, but won't spend a week of my time on it for free. I will probably start looking for alternatives at this point.

richardweinberger commented 1 year ago

@ljanyst Thanks for confirming, this covers also my analysis from 2021.

bartbutler commented 1 year ago

The main reason we didn't do a server side UID mapping in favor of a client-side UID mapping is that client-side remapping of UIDs is very useful when generating local changes and offline mode.

ljanyst commented 1 year ago

Could bolt have a bug where removing the last couple of entries from the bucket could in some cases cause the bucket sequence number to rewind? After a fast read through the description of their architecture, I could totally see that kind of bug happening.

ploum commented 1 year ago

Is there any mitigation possible right now for users?

The problem is now so important than any single action in my local mailbox is applied to a random mail of that mailbox. It’s no more a glitch with some emails, it’s totally a random mess.

Understanding that a fix will take days if not week to reach me (through an update of the bridge), what can I do right now to make my email at least usable?

Amunak commented 1 year ago

@ploum You can temporarily fix it by "repairing" the affected folders in the account, or, (worst case) by removing and re-adding the Bridge IMAP account in your client.

Basically you need to force your client to re-download all IMAP mail for the account.

AwlsomeAlex commented 1 year ago

It would be helpful to know which configurations are broken... or something of a statement other than this issue. I'd like to use Bridge 2.4.3 with Thunderbird 107b3 on my M1 Mac, but will refrain from doing so at the risk of losing emails...

brandonnodnarb commented 1 year ago

Just came to chirp and say I am having this issue on my home desktop. Thunderbird 102.2.2 on Pop OS, kernel 6.0.3. I noticed this issue about 4 days ago. I was using ProtonMail Bridge as a flatpak, but as the icon was inexplicably no longer in the tray, I removed everything and re-installed using the DEB from the ProtonMail Bridge web page. Sadly, both issues persist.

I also recall a while back (perhaps 6 months) a similar enough issue happening on my iPhone, although I remember the duplicate email issue being more apparent, there were email missing; so I'm lumping it in with this issue. At the time, dumping the local cache (a few times IIRC) and re-syncing seemed to fix it there. I only mention this in this thread to illustrate it does seem to be a persistent issue, potentially across platforms.

ljanyst commented 1 year ago

So here's what I think is happening:

  1. imapUpdates creates a single element channel for forwarding IMAP updates
  2. sendIMAPUpdate creates a goroutine that blocks until the channel is free to receive updates
  3. When deleting a message txDeleteMessage computes a sequence number from UID and notifies the client about the deleted message using the sequence number.
  4. When deleting multiple messages, multiple deletion notifications are sent to the client using sequence numbers, therefore multiple goroutines blocked on the channel are created.
  5. The order of delivery of deletion notifications depends on the scheduling of the goroutines.
  6. There's a non-trivial probability that the client will get sequence-based deletion notifications out of order.
ljanyst commented 1 year ago

Reproducer:

  1. Add time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond) before select https://github.com/ProtonMail/proton-bridge/blob/master/internal/imap/updates.go#L204
  2. Then:
func TestRace(t *testing.T) {
    u := newIMAPUpdates()

    for i := uint32(0); i < 100; i++ {
        u.DeleteMessage("foo", "bar", i)
    }

    for i := uint32(0); i < 100; i++ {
        upd := <-u.ch
        require.Equal(t, i, upd.(*goIMAPBackend.ExpungeUpdate).SeqNum)
    }
}
kortschak commented 1 year ago

@ljanyst Which commit are you building against? The current master's newIMAPUpdates takes an *imapBackend as a param. https://github.com/ProtonMail/proton-bridge/blob/9eb4703d7af19988f52722fc808f8df722240eab/internal/imap/updates.go#L49

ljanyst commented 1 year ago

I was testing against my fork (https://github.com/ljanyst/peroxide) which is a couple of commits behind. It does not matter though, just do u := newIMAPUpdates(nil). You also don't need to put delay in the goroutine. The test triggers the race even without it.

kortschak commented 1 year ago

@ljanyst Confirmed. Thanks

ljanyst commented 1 year ago

In case you want to fix it yourself, just create a bigger channel and get rid of the goroutine.

kortschak commented 1 year ago

Yeah, the issue here is that a select is very specifically not a way to make a queue.

ljanyst commented 1 year ago

This seems to work reasonably well also with the initial sync: https://github.com/ljanyst/peroxide/commit/89164ae00632083a61a44ac51df70435e49e9901

andrzejsza commented 1 year ago

Thanks for this @ljanyst. Yes, this is one potential source of UID instability in the current version of bridge. Other potential sources are internal to the go-imap library itself.

These issues have been fixed by design in the upcoming v3 release using our new IMAP library (gluon). When an IMAP client connects to gluon and selects a mailbox, it sees a unique snapshot of the mailbox, protected from the activity of other connected clients. When gluon wants to send an IMAP response to a client, it computes the correct response to send according what the client currently sees in its snapshot. Furthermore, we guarantee the correctness of the order of responses we are sending to clients.

ljanyst commented 1 year ago

Be it as it may, after a couple of days of pretty intensive testing, at least the problem of missing messages seems fixed by this.

MCKLtech commented 1 year ago

@andrzejsza , what is the ETA on V3? Are we talking a few weeks or into 2023? If it's into 2023, that's not acceptable, given there is a viable patch from @ljanyst. At the very least the options should be communicated to customers given at least one of the causes has been confirmed. The silence from Proton on a paid product is deafening and unnerving.

andrzejsza commented 1 year ago

We are aiming for beta release in December, and are already deep into alpha testing. The solution proposed above definitely makes sense in V2 architecture - we can clearly see how it guarantees the right order being sent. We are not too keen to just merge and expose it to everyone though, as it has unknown potential of introducing other issues - for instance the arbitrary buffer size of 1k and it's impact on the range of clients/memory usage etc. We would definitely look into this more and spend any required time on testing, if this would not be handled in the upcoming V3 release, which is a few weeks away (beta). But since it is, we'd rather not diverge team's focus right now, especially taking into account our relatively slow release process (we're working on that also).

We would really appreciate your comments on the new library: https://github.com/ProtonMail/gluon and we would like to open-source the V3 even before beta to get your feedback on the new architecture. We will be posting update here and in other relevant channels.

dsommers commented 1 year ago

Please devs, please push far more frequent to the git repo so we can keep track of where you're headed and test along side it. Try to do the development far more transparently.

karlemilnikka commented 1 year ago

I just realized that this bug had more consequences than I initially thought. Since the Bridge has been so unreliable for years, I’ve mostly used it to keep offline copies of my emails. I’ve also copied some emails from Apple Mail to a local folder in Finder. Yesterday, I started browsing through the archived emails I Finder, and they were definitely not the emails I once dragged from Mail to Finder.

Luckily, I haven’t lost any data. It’s just an annoyance that many emails aren’t the ones I once decided to keep in my Finder folder structure.

kortschak commented 1 year ago

I think I have discovered an interesting corollary of this issue that explains an odd behaviour I've noted over the past couple of years. Deletions may not be reflected on the backing servers. This means that for some cases items are deleted from the folder they are stored in, but not in the All Mail folder - these can then not be deleted except via the web client with prejudice. Over that past couple of years this has resulted in accumulation of thousands or orphaned deleted mails.

dsommers commented 1 year ago

I can confirm @kortschak's finding. Delete a mail from from the "Trash" folder via IMAP ... and it is still present in "All Mail" on the web mail. It is however not found under "Trash" in the webmail. The deleted mail in "All Mails" has the "trash bin" icon attached to it.

andrzejsza commented 1 year ago

indeed this is know - it's down to Bridge not supporting empty label. it was deliberate at the time to protect from potential data loss, but with V3 we will definitely revisit the behaviour (already reported here as https://github.com/ProtonMail/proton-bridge/issues/165). not related to the main topic.

kortschak commented 1 year ago

@andrzejsza Are support aware of this design intention? I have been repeatedly told by support that this observation is not possible.

andrzejsza commented 1 year ago

they are now... it's definitely not a standard behaviour, hence the confusion (also internal). thanks for introducing clarity here.

andrzejsza commented 1 year ago

just fyi: we've open-sourced v3 code today, you can find it here if you want an early preview. I will update you here once we have first beta ready.

kklem0 commented 1 year ago

just fyi: we've open-sourced v3 code today, you can find it here if you want an early preview. I will update you here once we have first beta ready.

I'm ready to try the preview :) I still see changes like changing internal ID format though, which is a bit funny when it should almost be ready, but it does seem to be in good shape.

andrzejsza commented 1 year ago

Feel free to test the recent alpha - available here as 3.0.4: https://github.com/ProtonMail/proton-bridge/releases Use with caution - we have quite a few open issues, mainly relating to Outlook and drafts behaviour. As for as message UIDs are concerned - no known issues so far. For any new bugs - please open a separate issue.

kklem0 commented 1 year ago

I'm now using 3.0.5, but I only got half of my emails in.

A lot of these in the log, they keep repeating so I stopped the bridge.

time="Dec  5 18:46:40.178" level=error msg="Get \"https://mail-api.proton.me/mail/v4/messages/[removed]\": context canceled, Attempt 1"
time="Dec  5 18:46:40.178" level=error msg="Get \"https://mail-api.proton.me/mail/v4/messages/[removed]\": context canceled, Attempt 1"
time="Dec  5 18:46:40.178" level=error msg="Get \"https://mail-api.proton.me/mail/v4/messages/[removed]\": context canceled, Attempt 1"
time="Dec  5 18:38:17.048" level=warning msg="Failed to sync user" error="failed to sync messages: failed to create IMAP update for message [removed]: failed to build message body and structure: mime: expected slash after first token" userID="[removed]"
time="Dec  5 18:38:17.048" level=error msg="Failed to sync, will retry later" error="failed to sync: failed to sync messages: failed to create IMAP update for message [removed]: failed to build message body and structure: mime: expected slash after first token" userID="[removed]"