optimize re-download on UIDVALIDITY changes

chris001 commented 9 years ago

@nicolas33

I found some critical analysis and review of the offlineimap algorithm and runtime performance, including charts and state diagrams. Thought you would like to see this feedback to evaluate which parts of the algorithm to improve / mini-refactor... which algorithms need upgrading / partial rewrite / full rewrite... etc.. http://blog.ezyang.com/2012/08/how-offlineimap-works/ http://blog.ezyang.com/2012/08/offlineimap-sucks/
it would be very good to compare algorithm with the other 3 popular imap sync applications. To borrow methods from their algorithms, if/when their algorithms are doing some task possibly more accurately/ more efficiently.

isync: http://isync.sourceforge.net/

mailsync: http://mailsync.sourceforge.net/

imapsync: http://imapsync.lamiral.info/

nicolas33 commented 9 years ago

Yes, @ezyang 's posts are well known.

The algorithm is not the reason why I intend a deep refactoring. The algo in OfflineIMAP is good. It's more the implementation that badly supported time and libraries not Python3 compatible.

isync, mailsync and imapsync are known tools from the community.

imapsync does not have a sync algo and doesn't support 2-way sync, so it's out of context.

isync and mailsync are good tools while they both have way less advanced features than OfflineIMAP.

Thanks for the input, though. I appreciate. :-)

chris001 commented 9 years ago

One algo that needs improving is the algo that determines message unique identity.

Currently using only the IMAP UID, which is not a dependable indicator.

The IMAP server can reset the UIDs at any time.

Best thing to improve quality of the results would be to implement a message unique identification algorithm that includes a combination of optional data. UID, In the headers, Message-ID, References, In-Reply-To. Store these data in a row in the sqlite database table.

That should be enough of an improvement to uniquely identify messages without having to download the entiree message, This would reduce the amount of bandwidth consumed, and speed up the sync.

nicolas33 commented 9 years ago

On Thu, Apr 16, 2015 at 10:50:00AM -0700, Chris Coleman wrote:

One algo that needs improving is the algo that determines message unique identity.

Currently using only the IMAP UID, which is not a dependable indicator.

Actually, I didn't have any UID problem for years.

The IMAP server can reset the UIDs at any time.

That's why IMAP provides UIDVALIDITY.

Best thing to improve quality of the results would be to implement a message unique identification algorithm that includes a combination of optional data. UID, In the headers, Message-ID, References, In-Reply-To. Store these data in a row in the sqlite database table.

That should be enough of an improvement to uniquely identify messages without having to download the entiree message, This would reduce the amount of bandwidth consumed, and speed up the sync.

This would increase both bandwidth usage and time of a sync. That's why I disagree with you, here. UID are just fine most of the time.

Nicolas Sebrecht

nicolas33 commented 9 years ago

On Fri, Apr 17, 2015 at 12:41:35AM +0200, Nicolas Sebrecht wrote:

Best thing to improve quality of the results would be to implement a message unique identification algorithm that includes a combination of optional data. UID, In the headers, Message-ID, References, In-Reply-To. Store these data in a row in the sqlite database table.

That should be enough of an improvement to uniquely identify messages without having to download the entiree message, This would reduce the amount of bandwidth consumed, and speed up the sync.

I think I'm getting why you think it could improve speed. I'm re-opening because it could worth some basic tests with time measures to compare.

Something very simple could do it with raw IMAP requests:

SEARCH
SEARCH

Would you mind setting up such speed tests?

Nicolas Sebrecht

chris001 commented 9 years ago

OK I set up a local speed test between, local machine, and remote vps server running dovecot (800km distance).

Time to read first UID : 2400 ms (but the server was under load from other web services running on it) Time to read subsequent UIDs : instantaneous. This is natural because the file containing the table which relates the UID to the message filename is not enormous, 1KB - 50MB, and is buffered in memory by operating system optimizations.

Time to lookup message based on UID : typically 5-10ms + roundtrip network time. Time to lookup message based on Message-ID.a bit slower, roundtrip network time + 10-25ms, depends if the information has been brought into disk cache.

nicolas33 commented 9 years ago

Nice. With Message-ID check in the run when UIDVALIDITY changed for a mailbox with say 1000 mails, this would mean:

SEARCH the new UIDs (no time overhead against NO Message-ID check)
SEARCH the Message-ID for each UID (+10-25 seconds IMAP side time overhead)

Of course, adding Message-ID only is not enough to get a full valid matching with what we have locally: it's possible to have more than one mail with the same Message-ID header and mails without a Message-ID at all. For those, a full re-download might be done.

If we have a few 90% matching, we are saving 900 re-download. This sounds quite good. There would be local time overhead to do all the checks with the local maildir but this would still be good.

This is good optimization for UIDVALIDITY changes.

OfflineIMAP / offlineimap

optimize re-download on UIDVALIDITY changes #190