ProtonMail / proton-bridge

Proton Mail Bridge application
GNU General Public License v3.0
1.14k stars 155 forks source link

The new bridge is much, much slower for incremental mail fetching. #354

Open macdanny opened 1 year ago

macdanny commented 1 year ago

I recently receive an update to the bridge (it must have happened automatically, I didn't take any action to upgrade).

I have just under 800 folders in ProtonMail, and I use mbsync and mu4e to retrieve and read my mail. I found out about the upgrade because mu4e stopped working because the bridge certificate had changed. I copied it and manually ran mbsync -a. I saw that every folder bumped the uidvalidity so I cancelled that and ran a different script I use to erase all my existing mail and my mu4e index and download everything from scratch.

The script that downloaded everything from scratch normally takes around 30 minutes, this time it took 20 minutes. Huzzah! That's a huge improvement, because the 30 minutes includes time backing up my old mail and rebuilding the mu index.

However, after the initial download, when I fetch my mail, mbsync takes much, much longer with the new bridge.

mbsync has a nice progress meter and it shows which folder it's on and the number of messages that are being processed for each one.

With the old bridge, for mailboxes that had no changes (most of them), the number denoting which folder it's processing spun around so fast it was basically a blur. It was going through ... I don't know ... 10 or 20 folders per second, maybe more. Only when it hit a folder where there was a change (a new message, change to existing message (flags probably), or deleted message) would the folder count stall for a bit while it figured out what it had to do. Then it would process the messages and proceed.

With the new bridge, mbsync stalls on every single folder. At best it's processing 2 or 3 ... maybe 4, max ... folders per second. This is when I just freshly downloaded all of my mail and there are zero changes in any folder. No new or deleted messages, no changes to any existing message.

Expected Behavior

It shouldn't take 5 times longer to scan my folders for changes than it did under the old bridge.

Current Behavior

See above.

Possible Solution

I don't know, I realize this is all very much in flux after reading a few of the recent issues submitted and I'm sure you all have a lot on your plates. I just hope it's not going to be like this forever because it's pretty awful for someone in my position.

Steps to Reproduce

See description.

Version Information

3.0.19

Context (Environment)

See description

Detailed Description

See description

Possible Implementation

Maintain a dirty bit on the server so folders with no changes can be skipped instantly?

LBeernaertProton commented 1 year ago

Hey @macdanny , lots of things changed between v2 and v3. I would love to find out what is causing these delays, but we are bit busy with some other higher priority items at the moment. It would be great if you could provide the following information so I can investigate this once my schedule clear up a bit.

macdanny commented 1 year ago

I have an average of 80 messages per folder, just under 800 folders.

It's invoking two commands per folder.

SELECT "{folder name}" UID FETCH 1:{n} (UID FLAGS)

Where {n} is the number of messages in the folder.

I'm not sure I understand the last question, but it doesn't matter if the bridge has been running for minutes or hours, the performance does not change. Since it only takes around 20 minutes to grab every single message off the server, I guess the account must be fully synched.

LBeernaertProton commented 1 year ago

Thank you for the investigation, I will try to find some time to check your scenario with some benchmarks.

The last question is to understand the performance behavior:

What's the hardware the bridge is currently running on? (CPU, RAM, Disk Type)?

macdanny commented 1 year ago

I am running the bridge now with --cli, it works but ... I don't see a progress bar in here either. It drew an ASCII art bridge and gave me a prompt. I tried logging in at the prompt, it said I am already logged in. I'm running a sync now and the bridge cli isn't giving me any other info.

My system has dual Xeon E5-2660 CPUs, 64 GiB of RAM, and two Samsung 850 EVO 500 GB SSDs running in RAID 0, but I use Qubes OS which divides my desktop into multiple Xen domUs. Bridge is running in its own domU. The Bridge domU only has 2 cores and 4 GiB of RAM. I could give it more cores, I have plenty. I don't think RAM is an issue judging by htop (the whole domU is using less than 500 MiB, even considering cache etc. there is almost 2 GiB completely unused). The SSDs are common to the entire machine so the Bridge domU is competing with all of them for disk IO but when I sync my mail there's nothing else really going on on the computer. There have been no changes to any of this recently and Bridge used to run a lot faster.

LBeernaertProton commented 1 year ago

@macdanny Thanks for all the feedback, it really helps to narrow down the issue. I'll look into this as soon as I have bandwidth to do so.

jknockaert commented 1 year ago

I also noted high battery use by bridge which I think is caused by the inefficient database implementation causing this issue. So I post my observation here for now and will report back when it gets fixed.

jknockaert commented 1 year ago

I guess this is actually a Gluon issue and it is being fixed in GODT-2510?

LBeernaertProton commented 1 year ago

Yes. I'm currently in the process of addressing this, but as always I can't promise any ETAs on completion state.

jknockaert commented 1 year ago

I guess this should be fixed in 3.4.0, right? Will the existing local database be fixed after in-place upgrading to 3.4.0, or should we do a full reset?

LBeernaertProton commented 1 year ago

@jknockaert Yes, these changes are present in v3.4. The database, will auto migrate, upon failure it will auto-resync your state. No user interactions required.

macdanny commented 1 year ago

I just received the upgrade to Bridge version 3.4.2. After reading the comments that this would be fixed in 3.4, I expected a significant performance improvement, but performance for incremental mail fetching doesn't appear to have changed at all.

Since the UIDVALIDITY changed I ran my script to wipe out my Mail directory and download everything from scratch. That operation might be marginally faster under this version, but it is still taking 142 seconds to just check for changes in each folder after the initial download. I have exactly 800 folders now, ~ 66000 messages between them.

jknockaert commented 1 year ago

Same observation here. The new version may be faster but it is not exactly fast. And with every point update the UIDVALIDITY changes and everything needs to be resynced on every computer. That's 25GB of data transfer for a point update. And 50 GB of disk IO, as the mail client will write another local copy of everything.

LBeernaertProton commented 1 year ago

Unfortunately, in v3.4 the UIDVALIDITY had to change due to the database migration. We'll keep investigating on our end to see what else we can improve for this scenario.

jknockaert commented 8 months ago

OK so I worked a bit on this one and I think I diagnosed the issue. I added another index on the local gluon database in sqlite3: create index test_index3 on message_flags_v2 (message_id); after which the Bridge becomes much more efficient (at least in my setup). I guess the index is used in a left join on a couple of large tables (I have 400k messages). So where should this have been accommodated for in the gluon code? My best guess is that the primary key for message_flags_v2 should have been defined as (message_id,value) rather than (value,message_id) as it is now and which probably happened here: https://github.com/ProtonMail/gluon/blob/dev/internal/db_impl/sqlite3/v1/migration.go#L295 In a similar fashion it may be more efficient to optimize the primary keys in mailbox_attrs_v2, mailbox_flags_v2, mailbox_perm_flags_v2 by switching the column order (but these tables are much smaller so probably not delivering much improvement in performance).

LBeernaertProton commented 8 months ago

@jknockaert nice catch. We'll try to get this incorporated as soon as possible.

macdanny commented 8 months ago

I added the same index. It takes 53 seconds now to scan my 809 mailboxes for changes, down from 120 without the index.

It still doesn't seem as fast as before I got the version that caused me to open this issue but it definitely makes a big difference.

@jknockaert Thank you for figuring this out.

jknockaert commented 8 months ago

@macdanny With 809 mailboxes you may be able to speed things up a bit more by adding indexes on mailbox_attrs_v2, mailbox_flags_v2, mailbox_perm_flags_v2:

create index test_index0 on mailbox_attrs_v2 (mailbox_id);
create index test_index1 on mailbox_flags_v2 (mailbox_id);
create index test_index2 on mailbox_perm_flags_v2 (mailbox_id);
macdanny commented 8 months ago

Thanks. I tried that, and I also tried re-ordering the columns in the primary keys, the other remedy you suggested. I couldn't get a further performance improvement I could be confident was not noise.

This is the script I ran to re-order the primary key columns.

create table mailbox_attrs_v2_dg_tmp
(
    value      text not null,
    mailbox_id uuid not null
        constraint perm_attrs_flags_mailbox_id
            references mailboxes_v2
            on delete cascade,
    primary key (mailbox_id, value)
);

insert into mailbox_attrs_v2_dg_tmp(value, mailbox_id)
select value, mailbox_id
from mailbox_attrs_v2
order by mailbox_id, value
;

drop table mailbox_attrs_v2;

alter table mailbox_attrs_v2_dg_tmp
    rename to mailbox_attrs_v2;

create table mailbox_flags_v2_dg_tmp
(
    value      text not null,
    mailbox_id uuid not null
        constraint mailbox_flags_mailbox_id
            references mailboxes_v2
            on delete cascade,
    primary key (mailbox_id, value)
);

insert into mailbox_flags_v2_dg_tmp(value, mailbox_id)
select value, mailbox_id
from mailbox_flags_v2
order by mailbox_id, value
;

drop table mailbox_flags_v2;

alter table mailbox_flags_v2_dg_tmp
    rename to mailbox_flags_v2;

create table mailbox_perm_flags_v2_dg_tmp
(
    value      text not null,
    mailbox_id uuid not null
        constraint perm_mailbox_flags_mailbox_id
            references mailboxes_v2
            on delete cascade,
    primary key (mailbox_id, value)
);

insert into mailbox_perm_flags_v2_dg_tmp(value, mailbox_id)
select value, mailbox_id
from mailbox_perm_flags_v2
order by mailbox_id, value
;

drop table mailbox_perm_flags_v2;

alter table mailbox_perm_flags_v2_dg_tmp
    rename to mailbox_perm_flags_v2;

create table message_flags_v2_dg_tmp
(
    value      text not null,
    message_id uuid not null
        constraint message_flags_message_id
            references messages_v2
            on delete cascade,
    primary key (message_id, value)
);

insert into message_flags_v2_dg_tmp(value, message_id)
select value, message_id
from message_flags_v2
order by message_id, value
;

drop table message_flags_v2;

alter table message_flags_v2_dg_tmp
    rename to message_flags_v2;

vacuum;
LBeernaertProton commented 8 months ago

A MR has been prepared to add the missing indices.

We should be able this in one of the next releases once the team returns from holidays.

Finally, I would recommend restoring the original db state when that change is deployed to avoid running into issues with the database migration.

LBeernaertProton commented 7 months ago

The index changes have been deployed with the v3.9.0 release.

To anyone who has made the migrations above, it is advised to rollback to the previous db state to avoid potential migration conflicts. If you run into a migration conflict you will need to reconfigure bridge for that account.

jknockaert commented 7 months ago

@LBeernaertProton Thank you for the update. I plan on resetting the bridge in any case upon upgrade to v3.9.0.