cyrusimap / cyrus-imapd

Cyrus IMAP is an email, contacts and calendar server
http://cyrusimap.org
Other
541 stars 148 forks source link

squat search_engine not used #2598

Closed MichaelMenge closed 3 years ago

MichaelMenge commented 5 years ago

TL;DR Searches in cyrus 3.0.8 are slower compared to 2.4.20 and there is no indication that cyrus.squat files are used.

I reported this issue on info-cyrus Mailinglist on 17. Sep 2018 with the subject "squatter not used after upgrade to cyrus 3.0.8" and reported further test results
on 25. Oct 2018

We use cyrus-imapd 3.0.8 in murder setup. (I don't think that "murder" is relevant
in this case. All test were carried out on the backend)

We have configured "search_engine: squat" in the backend imapd.conf, "search_fuzzy_always:" is NOT enabled (I tested it, but it didn't improve the performance and showed wrong results)

"squatter -i" is run in the EVENT section of backend cyrus.conf and the logfiles show that squatter is indexing the mailboxes.

Searching for HEADERS or TEXT result in slow searches, compared to cyrus-imapd 2.4.20. And running strace on the imapd process show that the cyrus.squat files are not opend/mmaped and that no attempt was made (no fstat either) while these files where accessed in 2.4.20.

Timing Tests on a Mailbox with 4321 Mails (~ 541 MB) from Strace (read "SE SELECT INBOX" to write "* BYE LOGOUT received")

SE SELECT INBOX TS SEARCH TEXT "Test1"
or
HS SEARCH HEADER X-comment Unirundmail L LOGOUT

Cyrus Version TEXT HEADER
2.4.20 0.025194s 0.016047s
3.0.8 10.253365s 1.453782s

search_text_3.0.8.strace.txt search_text_3.0.8.imtest.txt search_text_2.4.20.strace.txt search_text_2.4.20.imtest.txt search_header_3.0.8.strace.txt search_header_3.0.8.imtest.txt search_header_2.4.20.strace.txt search_header_2.4.20.imtest.txt

MichaelMenge commented 5 years ago

I tried "git bisect" to find the commit that did break the search, but unfortunately there are many commits I could not compile and had to skip.

I did fix some minor compile errors but in may cases it felt like different branches with incomplete implementation where interlaced and there was no easy fix for me to apply.

Attached is the git bisect log so that someone (maybe with access to the unmerged branches) can continue.

git-bisect-squat-search-2598.txt

MichaelMenge commented 5 years ago

One mailinglist mail form Sebastian Hagedorn in the "Xapian searches of the body of an email" thread pointed me in the direction that the conversations db might also be required for the squat search engine.

Enabling the conversationsdb in imapd.conf showed that the squat file was accessed, but in my first attempts not all messages where found.

I am not sure if squatter should work without conversationdb, but as conversationdb is disabled by default there should be at least a HINT in the imapd.conf manpage and the Upgrading documentation.

I will now test with different conversations_expire_days and rebuilding the conversationdb and squatter index to see if it will fix the problem with the messages that where not found.

Update: The squatter index is only used in header searches, but the text seaerch is still slow. Rebuilding the conversationdb with the userid (and not the the user/userid like i did the first time) did solve the problem with the not found messages

elliefm commented 5 years ago

Was this fixed by the change to add GUID to conversationsdb even when CID is missing? Or was that change for something else?

MichaelMenge commented 5 years ago

@elliefm in cyrus 3.0.8 enabling and building the conversationsdb results in squatter being used again for header search but still not for body search. I don't know at which point GUID/CID was added/missing. Maybe git-bisect log can help you with this question

MichaelMenge commented 5 years ago

I did some more testing as discussed with Robert @rsto

Below are the strace and imtest outputs from text and body searches with conversation db (enabled/disabled), search_fuzzy_always (enabled/disabled) As the search for "Test1" didn't find any mails in some of the test cases I also included the search for "Test".

What I did find is that with search_fuzzy_always enabled much fewer mails where found. I would have expected the opposite as fuzzy search should also include not exact matches. Regarding the question of squatter usage, the squatter files was accessed if search_fuzzy_always was enabled.

The discrepancy in the search results is concerning. I don't know if search_fuzzy_always should/can work squatter search engine, or if the original search returned the correct results.

search_text_fuzzy_always_conversation-2.strace.3.0.8.txt search_text_fuzzy_always_conversation-2.imtest.3.0.8.txt search_text_fuzzy_always_conversation.strace.3.0.8.txt search_text_fuzzy_always_conversation.imtest.3.0.8.txt search_text_fuzzy_always-2.strace.3.0.8.txt search_text_fuzzy_always-2.imtest.3.0.8.txt search_text_fuzzy_always.strace.3.0.8.txt search_text_fuzzy_always.imtest.3.0.8.txt search_text_conversation-2.strace.3.0.8.txt search_text_conversation-2.imtest.3.0.8.txt search_text_conversation.strace.3.0.8.txt search_text_conversation.imtest.3.0.8.txt search_text-2.strace.3.0.8.txt search_text-2.imtest.3.0.8.txt search_body_fuzzy_always_conversation-2.strace.3.0.8.txt search_body_fuzzy_always_conversation-2.imtest.3.0.8.txt search_body_fuzzy_always_conversation.strace.3.0.8.txt search_body_fuzzy_always_conversation.imtest.3.0.8.txt search_body_fuzzy_always-2.strace.3.0.8.txt search_body_fuzzy_always-2.imtest.3.0.8.txt search_body_fuzzy_always.strace.3.0.8.txt search_body_fuzzy_always.imtest.3.0.8.txt search_body_conversation-2.strace.3.0.8.txt search_body_conversation-2.imtest.3.0.8.txt search_body_conversation.strace.3.0.8.txt search_body_conversation.imtest.3.0.8.txt search_body.strace.3.0.8.txt search_body.imtest.3.0.8.txt search_body-2.strace.3.0.8.txt search_body-2.imtest.3.0.8.txt

MichaelMenge commented 4 years ago

hi. Any progress on this. Is there anything i can do to help any further?

pendragonsound commented 4 years ago

I also ran into this unexpectedly after upgrading 2.5.7 to 3.0.13. It was nasty shock because we have a number of power users (not me) who regularly search multiple mailbox folders with 100K to 2M messages in each. What used to take a blink of the eye now gets timed out after minutes of wheel spinning. As mentioned before, there is nothing in the release notes or upgrade advice that identifies this problem, and it's not easy/safe to back out of this kind of upgrade.

Before going live I had re-implemented the disappeared squatter -s option that that was killing our squat daily updates in off-line testing (I was the original author forever ago). But that of course does nothing to solve body searches with squat. Because I had kept squat for searching and I wasn't looking for this kind of problem, it slipped by in my off-line testing, as it's hard in a small server world to fully duplicate a full operational environment. It didn't take but a few days for user complaints to arrive after flipping the switch.

A couple of days ago I started tracing search requests through the new search API, and that wasn't a pleasant experience. After compiling a list of bugs and holes over a four hour period, it became clear that I don't have enough time to fix this, although it's possible this is simply because of fundamental misunderstandings on my part. My overall assessment is this is a work in progress, and on the more difficult side to complete.

I had also originally wondered whether it would be possible to shortstop squat requests through the new API to the original squat code, as the latter is essentially unchanged in 3.0.13. That would be a hack, but at least have some elegance. But while this appeared easier than simply fixing/redesigning the new API, given the limited amount of time I have, it looked to me to be too risky for the trickery required.

So instead yesterday I sat down to backstitch a parallel universe into 3.0.13. Anything that used to work with the old code (for the most part the IMAP search, sort, and thread commands), I wired the old API directly to the squat code. Any newer features I left with the new API. This was somewhat a nuisance because there are a lot of changed and incompatible structures, many with the same name, and there are new features, particularly caching and character set operations, where I had to update the old code to work within the 3.0.13 context. It's a lot of changes, but when I fired it up early this morning, it worked. In addition to making searches scream again, it also makes message sorting and threading for folder displays run like they did before. I can't emphasize enough how drastic this regression was.

I wasn't expecting this to work so quickly, and I'm running catch-up on high priorities that got shoved aside, so I'm not sure when I'll have time to package/post my changes for sites in a similar pickle. I would really hope that this gets fixed in the main codeline so I don't have to resort to hacking this every time we do a system upgrade.

rsto commented 4 years ago

I'm sorry you both ran into trouble with the new search code. This issue is still in our backlog. @pendragonsound Could you please send me your patch? It doesn't have to be cleaned up or ready as a merge request - a scaffold would be enough for me. If you don't want to share it in public, please send it to rsto@fastmailteam.com Thanks

pendragonsound commented 4 years ago

@rsto When I can find the time I'll be happy to assemble a patch, however it is only a workaround and a kludge at best. When Xapian was wedged into Cyrus, a lot of the functionality of the Squat path was removed under the guise of a new API. For Squat this forces the vast majority of search cases to fall back to the default, which is simply a brute force linear search through all raw messages in a mailbox. Not surprisingly this is extremely slow for large mailboxes. All I did in my hack/patch was adapt the old API (Squat only) to work in the new Cyrus, as it did before in 2.5.*. The side effect is this locks out Xapian, because there is no place to patch it into the old API. Thus the patch will be of zero use to you in fixing the new API. It is only of value to Squat users who need to maintain a workable system, while the issues with the new API are addressed.

This begs the question of why I simply didn't convert to Xapian, because of its favored status with Fastmail. Early in our upgrade to 3.0.13 I set up a sandbox with a cloned copy of a large mailbox running 3.0.13/Xapian and compared it to 2.5.7/Squat (our operational system at the time). Xapian was far more powerful in terms of search complexity/specificity, but was unbearably slow compared to Squat for simple searches in large mailboxes. Given that a vocal portion of our users searches multiple large mailboxes simultaneously, Xapian was not a workable solution. But this was before I discovered that Squat is also unusable out of the box in 3.0.13.

I can't say whether the new search API was optimized to make Xapian work well at the expense of Squat, or whether the new API is riddled with holes/bugs such that both Xapian and Squat performances are penalized. When I was tracing search requests through 3.0.13 in order to try to fix it, I kept stumbling into code that appeared to be more a stub than an implementation. It wasn't broken because the plodding linear search would always kick in, but it wasn't really viable. I didn't evaluate whether this only affected Squat, because I wasn't tracing through the Xapian side. Regardless I concluded I wasn't in the mood to redesign the new API or finish its implementation, so I didn't bother documenting even a partial list of internal code problems - sorry.

MichaelMenge commented 4 years ago

Hi,

Any progress on this. @pendragonsound i am also interested in the patch.

pendragonsound commented 4 years ago

This is my patch against cyrus-imapd-3.0.13 to restore squat search functionality that was removed some time ago from the mainline code, presumably to support Xapian search with the new search API. As described in past comments, our user profile spans a number of power users with gigantic mailboxes who often perform global searches against their entire corpus. We asked a handful to try an experimental build of cyrus-imapd with Xapian - the feedback was universally negative when compared to squat, even when we pointed out the increased feature set that Xapian offers. Without this patch current cyrus-impad squat searches theoretically still work, but take orders of magnitude more wall clock time for anything but pathologically simple searches. This is because the moronic new search API is hardwired to only support Xapian. Squat searches are reduced to linear scans of raw messages and make little or no use of the squat databases. The net result is email clients timeout on their searches and users lose any ability to search medium to large mailboxes.

The correct solution to this problem would be to develop a well-abstracted search API that is not tied specifically to Xapian. I don't think this would take me more than a week or two, but I would have to rip out the new API and start over from scratch. Given that it is unlikely the powers controlling the baseline could stomach this and I have better uses for my time, I instead decided to spend a few hours hacking in the old API into a then-current cyrus-imapd baseline, somewhat alongside the new API. In the process I ignored any damage to the new API that would prevent my hack from supporting Xapian. Thus this patch is only useful to sites running squat, and is probably not a viable starting point to ultimately sort out the search API debacle.

We have been running this patch unaltered since early May 2020. I've noticed a few cases where our webmail software (Horde) aborted during a search, but other than a quick look I haven't had time to drill down and diagnose those. No users have complained, however. We did subsequently contact all the power users we know and they gave us a full thumbs-up. Nevertheless, because I only spent a few hours developing the patch, I would hardly be surprised to find bugs in it. Use it entirely at your own risk. I am in no way committing to provide any support for it.

squat-search-patch.txt

shodanshok commented 3 years ago

So the current status is that squat search is broken for body/text, while one need to enable conversation_db to let squat work on headers, right? Excluding Xapian, do we have any workaroud (short of apply custom-made patch) to restore fast text search with squat?

Thanks.

pendragonsound commented 3 years ago

shodanshok - you don't have many choices unless you want to fix or rewrite the new search API to work with squat. There are several potential implementation approaches for that, none of them trivial. It's just a pity the Fastmail people have so taken over Cyrus development that they are only focused in adapting it exclusively for their own purposes. Bad luck for someone with strong reasons to run a non-Fastmail clone. My biggest concern is the developers are preoccupied with whatever new shiny objects appear, and have next to zero understanding/appreciation of what already existed in Cyrus. Thus rather than investing any time trying to comprehend existing code, they simply delete it and write something totally naive. I'm seeing this far too often with the current crops of software developers.

Forget conversation_db. That is a red herring with squat. You can either (1) switch to Xapian and lose the key advantages of squat, (2) continue suffering with slow squat linear searches, (3) apply my awful patch, or (4) wait for the Fastmail folks to fix the squat search path. I doubt (4) is a realistic option because the developers have been killing squat with a thousand cuts over the past few years, and it's pretty apparent there is no meaningful ability to go back and repair the search API in an intelligent fashion. My expectation is squat will simply be deprecated and removed after the developers achieve their self-fulfilling prophecy.

If I were younger and had more time, I would have long ago forked a Cyrus branch where I could rip out and redo any Fastmail self-serving code. I still may do that if things get worse, which is where the code appears to be heading.

shodanshok commented 3 years ago

Well, these are bad news. On RHEL/CentOS, cyrus-imapd seems to be compiled without xapian support, so I basically are out of luck regarding fast full-text search (note: the sphinx backend seems to be almost non-documented, and it significantly more complex to use than squat or xapian).

I really hope squat support to be fixed in the coming releases.

Regards.

gbulfon commented 2 years ago

@pendragonsound I would like to share our experiences on cyrus imap and squatter, but I don't have any way to contact you via GitHub. I got almost the same trouble as yours, and I'm in the middle of the upgrade to 3.4. For example, in our tests, 3.4 threading is much slower than it is on 2.5.17, and also squatter searches will include in results any recent email not yet sqattered without being filtered by the query. Maybe you have addressed these already?

pendragonsound commented 2 years ago

@gbulfon A year ago I backported rsto's patch from 2021/01/05 into 3.0.13 and re-implemented my squatter -s (skip unchanged mailboxes) patch from decades ago that had been ripped out by the fastmail folks. We're not seeing any problems with threading although I'm not sure how many of our people use that on large mailboxes. I can confirm we see the same 'recent email' filtering problem that you reported. However with the squatter -s patch we simply run squatter frequently as it only has to process the mailboxes changed since the previous update. This is ugly but until we decide how to proceed with cyrus, it works well enough. We'll probably stay on 3.0.13 for a long time as I don't want to deal with another cyrus-fastmail upgrade debacle. The less appealing option is to write file format converters to return to 2.5.7. That version worked much better for us than 3.0.13. I wish we had never upgraded. It would have been easier to selectively backport the few useful patches into 2.5.7 than follow the path we did.

gbulfon commented 2 years ago

@pendragonsound the main reason we're willing to move over to 3.4 is that squatter files are way smaller than 2.5 (2GB may turn into 150MB!). Squattering big folders takes much less time. Also, 2.5 has some sort of bug that in big folders it may core dump because of a wrong reference pointer, but I never had any answer on this problem. Did you ever face it? Also, I really hope we may have a real incremental squattering someday, it is a nonsense that a big folder that has already been indexed, upon receivng just a new message, needs to be reindexed from scratch. Don't you think?

pendragonsound commented 2 years ago

@gbulfon I may have missed something, but I don't recall noticing any significant change in the size of our squat files between 2.5.X and 3.X. While we have a number of users with 20M+ messages in some of their mailboxes, we have an optimized file system that requires little time to perform a squatter update. We never had any problems on 2.5.X with large mailboxes causing squatter to abort. In fact we never had any problems with 2.5.X at all. Our experience on 3.X has been slightly less pleasant.

I just noticed the squatter -s option has been restored; we had submitted that hack back in 2002 and it works well enough for us. Several years ago I wrote code to index on the fly, as messages were added/deleted from mailboxes. After getting that to work well enough for benchmarking, I found it wasn't going to provide a significant enough performance improvement (at least with our configuration) to justify maintaining the code on our own.

gbulfon commented 2 years ago

@pendragonsound we have now mailboxes of over 70M mails, some may be of over 25GB of data, and these may take more than 30 minutes and sometimes coredump, leaving the old index forever and trying to reindex every time. We tested the same situation with 3.4 and we had no core dump ever, much smaller files and a very reduced time.

What did you do to optimize the file system for squatter?

pendragonsound commented 2 years ago

@gbulfon You likely have much bigger challenges than we do. While a number of our users keep 70M+ emails, at the moment we rarely see much more than 20M in any given mailbox. I'm curious why the 3.4 squatter would make the index files dramatically smaller. I didn't notice any changes that would explain this, but I only took a quick look. Given that big emails typically contain binhex encoded attachments, a small improvement for that could make a huge difference.

We're running a very small domain with mostly power users. Email is one tiny piece of our puzzle. Our servers are built for our specialized internal needs, and this serendipitously helps our email setup. We configure substantial hardware RAID6 arrays of SSDs with Reiser file systems for email and often have a lot of RAM available for caching. Thus squatter runs complete in minutes. Years ago they took hours with a lot less email. Our users are also cooperative - we encourage them to keep giant mailboxes separate from their INBOXes.

gbulfon commented 2 years ago

@pendragonsound infact, my suspect is that squatter 2 is indexing all attachments base64 data, which is completely useless.