ProtonMail / proton-bridge

Proton Mail Bridge application
GNU General Public License v3.0
1.1k stars 145 forks source link

Message UIDs are not stable / possible data loss (deletion of wrong messages) #220

Closed richardweinberger closed 1 year ago

richardweinberger commented 2 years ago

My colleges and myself facing wired issues with ProtonMail Bridge and various mail clients (getmail, Kmail, Apple Mail). Symptoms include:

All our mailboxes are large (50k to 100k mails) . This is maybe why we're facing the issue so easily.

Finally I found the proof that UIDs as presented by ProtonMail Bridge are not stable: As of now the bridge uses UID 51950 for the most recent message in my inbox.

a SELECT INBOX
* FLAGS (\Seen \SEEN \Flagged \FLAGGED \Deleted \DELETED \Draft \DRAFT $Junk Junk NonJunk)
* OK [PERMANENTFLAGS (\Seen \SEEN \Flagged \FLAGGED \Deleted \DELETED \Draft \DRAFT $Junk Junk NonJunk)] Flags permitted.
* 51950 EXISTS
* 0 RECENT
* OK [UIDNEXT 51952] Predicted next UID
* OK [UIDVALIDITY 4] UIDs valid
a OK [READ-WRITE] SELECT completed
a FETCH 51950 (FULL)
* 51950 FETCH (FLAGS (\Seen NonJunk) INTERNALDATE "29-Sep-2021 21:00:02 +0200" RFC822.SIZE 7129 ENVELOPE ("Wed, 29 Sep 2021 21:00:02 +0200" "Cron XXXX" (("(Cron Daemon)" NIL "root" "CENSORED")) (("(Cron Daemon)" NIL "root" "CENSORED")) NIL ((NIL NIL "root" "CENSORED")) NIL NIL NIL "<20210929191942.2F34CEBAD@CENSORED>") BODY ("text" "plain" ("charset" "utf-8") NIL NIL "quoted-printable" 4580 185))
a OK FETCH completed

As you can see UID 51950 is a mail from my cron daemon with date 29.09.2021 21:00. UIDVALIDITY is 4.

When I look at the getmail status file from my backup of 27.09.2021, I see the following line: 4/51950^@1632752979

The line decodes as "getmail downloaded mail with UID 51950 and UIDVALIDITY 4 on UNIX time 1632752979 (27.09.2021 - 16:29:39)"

This is the reason why I don't get the above mail delivered into my Maildir! getmail thinks it has already downloaded that message and skips it.

Expected Behavior

UID/UIDVALIDITY pairs should be stable such that mail clients have a chance to detect what mails are new.

Version Information

ProtonMail Bridge br-1.8.9 getmail 6.18.4

richardweinberger commented 2 years ago

One more insight form our side. Three of my colleges faced massive problems with their mail clients yesterday around 15:00 CEST.

For my point of view the instability of UIDs has something do to with the Protonmail backend. Maybe yesterday the backend changed something and caused the bridge to change UIDs. How do you generate UIDs?

iokill commented 2 years ago

I can confirm this issue w.r.t Apple Mail:

I've got a mail from 2021-01-26 here. When I mark it unread via the ProtonMail WebApp, Apple Mail logs this:

READ Sep 30 10:49:52.366 [kCFStreamSocketSecurityLevelTLSv1_2] -- host:127.0.0.1 -- port:1143 -- socket:0x600006a4f7e0 -- thread:0x60000da98fc0
* 5846 FETCH (FLAGS (NonJunk) UID 5846)

Going back to Apple Mail, it marked a completely different mail unread (same mailbox obviously, but dated 2020-11-20 and different subject)!

The reverse check confirms that UIDs are out of sync: When I mark the mail from 2021-01-26 (the same one I marked unread in the ProtonMail WebApp) as unread in Apple Mail, it logs this:

WROTE Sep 30 10:59:01.888 [kCFStreamSocketSecurityLevelTLSv1_2] -- host:127.0.0.1 -- port:1143 -- socket:0x600006a4f7e0 -- thread:0x60001138a6c0
277.79 UID STORE 6976 -FLAGS.SILENT (\Seen)

READ Sep 30 10:59:06.073 [kCFStreamSocketSecurityLevelTLSv1_2] -- host:127.0.0.1 -- port:1143 -- socket:0x600006a4f7e0 -- thread:0x60001138a6c0
277.79 OK UID STORE completed

Note the completely different UID!

What is more, I've got mails in the WebApp that are missing from the same mailbox in Apple Mail. Marking one of these as unread, marks some other mail as unread in Apple Mail.

The only way to fix this for now, is to tell Apple Mail to resync the whole mailbox via "Mailbox" -> "Rebuild". After this its okay, but I'm not sure for how long...

[Edit]: I'm using ProtonBridge br-1.8.9 and Apple Mail 14.0 (3654.120.0.1.13)

ericLemanissier commented 2 years ago

+1 for the emails disappearing from Outlook, but still visible in webmail yesterday afternoon CET. We have to archive them and unarchive them in order to make them visible again

bartbutler commented 2 years ago

Strange...literally nothing happened with either bridge or backend yesterday around that time.

richardweinberger commented 2 years ago

Strange...literally nothing happened with either bridge or backend yesterday around that time.

Well, what scares me more is that the bridge does not ensure stable UIDs. Which is a plain RFC violation.

bartbutler commented 2 years ago

It definitely should. We'll have to look into this.

cuthix commented 2 years ago

Hi @richardweinberger , maybe I don't understand something, you are saying but in your example you are not using UID but sequential ID.

a FETCH 51950 (FULL)

can you please try

a UID FETCH 51950 (FULL)

to confirm this is gives you correct result i.e. from 27.09.2021?

Of coarse this doesn't explains other symptoms you are reporting like

I don't get the above mail delivered into my Maildir

but it would help us to investigate.

Also if you can contact me on jcuth@proton.ch I would like to see more IMAP logs. Thanks

cuthix commented 2 years ago

Hi @iokill, thanks for investigating this, I would like to ask for more help. Can you please share more IMAP comunication? (e.g. you can sent me email to jcuth@proton.ch)

I noticed that threads of your logs are different: thread:0x60000da98fc0 vs thread:0x60001138a6c0 not sure if each command has different thread or each connection has different thread. If later it can be that you are different folder?

richardweinberger commented 2 years ago

Hello @cuthix ,

Hi @richardweinberger , maybe I don't understand something, you are saying but in your example you are not using UID but sequential ID.

I think you are right. My IMAP knowledge is a but rusty, I have mixed these ids in my report.

a FETCH 51950 (FULL)

can you please try

a UID FETCH 51950 (FULL)

to confirm this is gives you correct result i.e. from 27.09.2021?

No, this returns a mail from 29-Sep-2021 20:09:56 +0200. So not the mail from 27.09.2021.

Of coarse this doesn't explains other symptoms you are reporting like

I don't get the above mail delivered into my Maildir

I use getmail to fetch mails from IMAP into my Maildir. getmail has a simple text file where it stores the UID of each mail it has ever downloaded. Since 51950 was the UID of an mail at 27.09. it does not download new mail that has now UID 51950. Same applies to many other mails. As soon the bridge uses new UIDs getmail will download these messages.

but it would help us to investigate.

Also if you can contact me on jcuth@proton.ch I would like to see more IMAP logs. Thanks

Sure!

iokill commented 2 years ago

Hi @cuthix,

Hi @iokill, thanks for investigating this, I would like to ask for more help. Can you please share more IMAP comunication? (e.g. you can sent me email to jcuth@proton.ch)

Sure, will do!

I noticed that threads of your logs are different: thread:0x60000da98fc0 vs thread:0x60001138a6c0 not sure if each command has different thread or each connection has different thread. If later it can be that you are different folder?

It's definitely the same folder/mailbox. I assume the threads are different, because I did some things in Apple Mail inbetween these two tests. Notice there are ~10min between the two log segments. But I don't know the internals of Apple Mail, so might just be a thread pool that's used here...

richardweinberger commented 2 years ago

Did the debug data I've sent by mail help?

andrzejsza commented 2 years ago

@richardweinberger somewhat, thanks for all your help so far. we were a bit distracted by what getmail was doing, but the issue is still under investigation. sorry, can't give you and eta at the moment - we are revisiting big parts of our backend architecture right now and need to understand this issue fully. for internal reference - GODT-1373

richardweinberger commented 2 years ago

@andrzejsza Thanks for the update. Please note that I'm not the only one that was affected. @iokill is using Apple mail and @ericLemanissier Outlook.

richardweinberger commented 2 years ago

It's been another three weeks. Could you please give us a short update on your findings so far? Thanks!

richardweinberger commented 2 years ago

So, the problem happened again. Luckily I have patched[0] getmail to detect the problem early.

All of a sudden UID 66 is a different message. Before the event the message behind UID 66 had a length of 10886 bytes, now UID 66 leads to a message with 22467 bytes.

I noticed also that proton-bridge dropped the log file, just like last time I have logs long before the event and right after restarting proton-bridge. But no logs files while the event happened. Can it be that an automatic update of the bridge caused a database migration? Automatic updates are enabled on my side.

[0] https://github.com/richardweinberger/getmail6/commit/3abcceffa253a70e428368ec9e276b0b49058b37

JulienPalard commented 2 years ago

Can this be related to https://github.com/ProtonMail/proton-bridge/issues/204 that just happened to me again?

richardweinberger commented 2 years ago

Can this be related to #204 that just happened to me again?

Absolutely. The problem you describe is what we see too.

richardweinberger commented 2 years ago

So another week has passed without zero response here and via ProtonMail Support. Due to your lack of willingness to take this issues seriously I’ll have to cancel my business subscription and host my mail somewhere else. At the end of the day I need a stable and working mail solution for my company.

Don’t get me wrong, bugs do happen, also nasty user visible bugs. That’s okay. But letting paying customers in the dark without showing reasonable interest in getting a problem sorted out is not okay.

davidhamann commented 2 years ago

Sadly, I just have a "+1" to add to this. I've been dealing with these same issues for quite some time now (generally using latest Apple Mail and latest Bridge). Compared to the initial issue report my mailbox is relatively small with only around 5k emails. While I haven't tracked my problems in detail, I can say that I had to fully resync my mailbox multiple times in the last couple of months to resolve those issues.

andrzejsza commented 2 years ago

firstly - appologies for the delay, the comms here has not been good on our side. we've spent some time on creating more tests, namely:

the reason why we're not putting top prio on this at the moment is that we're doing a significant rewrite of Bridge's backed, including everyting that inovlves client communication. we have a strong level of certainty that this would resolve not only this, but also plenty of other issues. with the rewrite in progress, we don't have enough bandwidth to be investigating this further at this very moment. sorry again to be causing dissapointment. keeping the issue open until a reboust new solution is in place.

vmivanov commented 2 years ago

If the current Bridge isn't getting enough attention for this is there at least an ETA for the rewrite? Or is the new rewrite once again going to come out in "Proton time", i.e. anywhere between a few months to few years after announcement?

I've already lost messages because of #204, which appears to be a direct byproduct of this bug, thinking I had deleted the right one based on the subject. Luckily nothing critical but I do not think this should be acceptable for any mail provider, paid or not. I use IMAP far more extensively than the web interface and now I'm hesitant to delete any mail via IMAP. My confidence in Bridge is very eroded. How are customers supposed to take the new rewrite seriously if sufficient "bandwidth" isn't redirected to make this a top priority for investigation and resolution?

andrzejsza commented 2 years ago

the rewrite is currently taking place (started September last year) and we will have an independent library replacing go-imap in the upcoming weeks. then we will have to integrate it within Bridge, refactoring other parts of the code base, which is a task that due to its complexity is hard to estimate. we're hopeful to have a beta release early in Q2. this project is a top priority for us at this moment. @vmivanov #204 is clearly a Thunderbird issue. not to say that Bridge couldn't overcome it, but it's down to the fact that tbird UID fetches in a very peculiar way. if you are experiencing this or any other issues do contact our support - we would at least look into the logs and confirm that the problems you are facing are within the scope of said refactor.

richardweinberger commented 2 years ago

@andrzejsza Can you please elaborate a little more. Is go-imap broken? I don't fully understand how you are going do address this issue. To be clear, in the current state proton-bridge is dangerous to use. It will cause trouble. In best case it confuses just the mail client enough to disturb mail syncing. In worst case it eats your data because a mail client issues destructive command on the wrong message id.

cuthix commented 2 years ago

Is go-imap broken?

Long story short, not. Because UID is not part of go-imap. It must be developed by backend i.e. on bridge side.

Can we fix it on bridge side with go-imap?

Not that we are aware of.. If there would be some simple patch which we would be capable of better monitoring the IMAP session in bridge we would do it. But in this case it is much better to build IMAP library which takes care about all aspects of RFC3501 by design (e.g. the generation and handling of ID)

richardweinberger commented 2 years ago

Since a new/fixed proton-bridge will be available earliest in Q2, users should get warned about this issue, IMHO. I fear only very few users debug the problem far enough to find this discussion here on github, most will be fobbed off by Proton Support.

Tyler-2 commented 2 years ago

I opened an issue for the Evolution email client and had been troubleshooting that on and off.... I never imagined that Bridge would fail such a basic task as maintaining consistent message IDs.... So yeah, this seems like a deal breaker for Bridge. Paying member since 2016, Plus since 2018.

https://gitlab.gnome.org/GNOME/evolution/-/issues/1582

richardweinberger commented 2 years ago

https://protonmail.com/blog/new-protonmail-bridge/ talks about "Major stability and performance improvements", does it contain a fix for this issue?

andrzejsza commented 2 years ago

no, there is no 'fix' as the issue is still not clear. we know of very different behaviour of tbird, Evolution or Apple Mail. as mentioned above:

Tyler-2 commented 2 years ago

@andrzejsza Please check out the findings of @mcrha over here: https://gitlab.gnome.org/GNOME/evolution-data-server/-/issues/373#note_1371012

There's also some other potentially useful captures of broken Proton Bridge transactions.

I'm not particularly IMAP savvy or a developer by trade or hobby, but I can definitely reproduce this easily - as a silver lining I get to see very old emails sometimes when I click on recent emails, and sometimes they're good memories. :-P

richardweinberger commented 2 years ago

FWIW, this is my getmail config:

[retriever]
type = SimpleIMAPRetriever
server = 127.0.0.1
port = 1143
username = me
password = prinzessin
mailboxes = ("INBOX",)
use_peek = false

[destination]
type = Maildir
path = /some/Maildir

[options]
verbose = 1
read_all = false
delivered_to = false
received = false

I use getmail to fetch mails every few minutes. After days/weeks getmail's internal state will become inconsistent because the IMAP server scrambled message ids. With an inconsistent state getmail will no longer fetch new mail because it thinks it has already downloaded them. Please note that my mailbox is large, Inbox contains about 60k mails. Parallel to that I use ProtonMail via Web and Phone. Maybe this plays a role too...

mcrha commented 2 years ago

Please check out the findings of mcrha over here: https://gitlab.gnome.org/GNOME/evolution-data-server/-/issues/373#note_1371012

The problem found in the above comment is not about changed message UIDs. The ProtonMailBridge sometimes returns garbage instead of message headers, which breaks software. More in the referenced link.

I'd rather open a new bug for it, because it has no direct relation with the changed UIDs. Unless it's fixed in the latest version, because this had been tested on the one before the major changes release.

JulienPalard commented 2 years ago

I would happily read news from Protonmail about this issue.

richardweinberger commented 2 years ago

I don't care anymore. My company has moved away from ProtonMail in the meanwhile.

ericLemanissier commented 2 years ago

please don't close the issue. We still care !

richardweinberger commented 2 years ago

No worries. :-) But since ProtonMail folks haven't even confirmed this issue as bug, you see how much they care...

bartbutler commented 2 years ago

We care, but as we said, we can't reproduce it but we think it may be fixed with the bug refactor we are doing for Bridge 3.x.

andrzejsza commented 2 years ago

of course we care and sorry to hear you've left us Richard. as mentioned, the significant refactor is taking place right now - we hope to have something ready for testing in not a distant future. happy to include you guys in the earliest alpha, as we would really like to make sure this is working moving forward.

vmivanov commented 2 years ago

That's good to hear but I'm a bit surprised by the above responses. This is a critical component of the key service that PM provides. At this point in time Proton Bridge cannot be reliably used and puts data at risk.

Virtually all places I have worked at if such a critical bug were to be discovered we'd be diverting any and all development resources to fix it asap - whether that's identifying and patching or expediting the next release. "We can't reproduce it" wouldn't be an acceptable answer given that it clearly affects people. I'm equally surprised that there aren't more reports of this bug occurring.

Given Proton's track record of release cycles Bridge 3.x might not even see the light of day within the next 1-2y and that would be an optimistic estimate. Does that mean that paid subscribers will have to deal with software that puts their communication data at risk for that period?

andrzejsza commented 2 years ago

it is a critical component. that's why we've committed most of our internal resources into the rewrite mentioned above. yes, the issue has not been reported more outside of this channel but that doesn't mean much - it's still critical. to tone down the discussion a bit - this might have a potential of putting data at risk, but it does not as far as the above reports go. and sure - our track record of release cycle has no been great. it is getting better though and we'll be opensourcing our recent work as a separate library shortly.

vmivanov commented 2 years ago

I respectfully disagree that it doesn't seem to put data at risk. As I've mentioned above, I've lost messages due to email metadata being incorrectly mapped to an email and unintentionally deleting the wrong email.

I previously attributed it #204, though I'm not so sure now. While #204 may have been attributed here as being a "thunderbird bug" I've never encountered this (in the past or present) with any other non-Proton IMAP accounts that I have added. This is not limited to a single machine/instance of Thunderbird, and makes me think it's got something to do with IMAP over Bridge.

This bug seems to be the only Bridge-related one that could cause this due to the perceived instability of the message UIDs where the Thunderbird cache state at some random point starts to diverge from that of Bridge. It might be a coincidence, but rebuilding the index cache every time is not a viable solution and neither is deleting only via the web or mobile app.

Tyler-2 commented 2 years ago

I respectfully disagree that it doesn't seem to put data at risk. As I've mentioned above, I've lost messages due to email metadata being incorrectly mapped to an email and unintentionally deleting the wrong email.

Same. And being unable to read an older email I can see the headers for, without going into webmail. And, my god, what happens if you run a filter....

the Thunderbird cache state at some random point starts to diverge from that of Bridge

Just to point out / remind this happens in Evolution too, in case someone says "Oh so it's a Thunderbird problem".

andrzejsza commented 2 years ago

yes, this is definitely not only a client problem, but a combination of it with go-imap. that's why, knowing of a lot of other go-imap related issues, which a causing a range of difficulties since Bridge's inception, we made a choice of creating our own IMAP solution, instead of debugging the existing one. I hope you'll guys will bear with us and be satisfied with it when it arrives. we really cannot but higher priority on this, as it is our top priority. we'll update as soon as we have something to show.

jttttttttt commented 2 years ago

This is an issue for me too, unfortunately.

@andrzejsza Can you give a rough timeframe for the rewrite?

m4rkw commented 2 years ago

This is shocking. A bug that can cause data loss in a key component of an email platform, reported in November last year, and still not fixed. I've been relying on proton-bridge for mail backups but now reading this it seems there's a good chance it could not actually be doing what I thought it was doing.

Imagine if someone reported a critical bug like this to Microsoft and they said "very sorry, we'll just rewrite the entire thing and then it'll probably be fixed, get back to you in a year or so k". I'm now seriously questioning whether I should be using this service at all if this is how much they care about customers losing data.

andrzejsza commented 2 years ago

please keep this place for technical discussion, as is intended. we have no confirmed examples of data loss and are convinced that our current approach is the right one to solve the reported issue of unstable UIDs. we are always keen to investigate each individual case, for which we need logs and contact with our support team. we are making a significant progress with the new library - https://github.com/ProtonMail/gluon - and will start integration within Bridge shortly.

m4rkw commented 2 years ago

we have no confirmed examples of data loss

But you accept that message UIDs are not stable? Anyone using this for backups has a high likelihood of losing data. As has already been described by others, deleting messages via the bridge is not safe. It's pretty obvious that if the primary key of a data storage system cannot be relied on its going to put data at risk. The fact that this has not been investigated properly and fixed since it was reported is bad enough but to continue to offer the bridge and make no mention of this issue anywhere on your website is deeply troubling. I'm quite upset that I've been relying on this system for backups for quite some time with no knowledge of this issue until now. I think the least you could do is warn users not to use the bridge until this issue is fixed. Now that I know this is an issue I can take action to limit the damage but I'm very annoyed that it took someone else linking to this issue for me to even know it was a problem.

Surely you can accept that the possibility of losing email is about as severe a problem as there can be with an email system?

m4rkw commented 2 years ago

Ok I just ran an integrity check of my email backups that use the bridge. I run offlineimap to incrementally sync via the bridge and snapshot the local maildir once a day. To verify integrity I started syncing the entire account to a separate maildir and then wrote a simple python script to compare file sizes for each UID between the incrementally-synced maildir and the new one. I have something like 100k messages, it’s not finished yet but it’s pretty obviously screwed.

$ ./diff.py
checked 36841 messages, 6/36841 ok (99.983714% bad)

I will do deeper analysis later to determine how many messages have actually been excluded from the backups because of this issue but since the UIDs are apparently all scrambled I’m going to guess the answer is “several”.

The apparent lack of care over this issue and the complete negligent failure to notify users of proton-bridge that it’s unsafe has left me feeling very disillusioned about the service in general and I’ll probably start looking for alternatives.

bartbutler commented 2 years ago

There are a number of clients out there (Outlook comes to mind, at least some versions) which, rather than do the rational thing on move which would be copy then delete, do delete then copy such that if you have any sort of error you can end up in weird situations like a message only in All Mail because the delete worked and the copy didn't.

bartbutler commented 2 years ago

This is out of my expertise in terms of the exact mapping between IMAP commands and how that gets translated into Proton API calls but yes, it's plausible that that is the result of those commands. Fundamentally a lot of this comes down to attempting to handle the fact that in Proton one message can live in multiple folders/labels and IMAP has no such concept, messages in different IMAP folders are different messages. So a lot of time was spent trying to figure out how to ensure that people can't easily unintentionally delete messages when they want to remove it just from a particular folder. Anyway, there are almost certainly things we can do to improve it, some of which is already underway.

bartbutler commented 2 years ago

Pretty sure it has nothing to do with All Mail :). Yes, we are hiring. And in particular, have added resources to the Bridge team recently.