filiphanes / fts-elastic

ElasticSearch FTS implementation for the Dovecot mail server
Other
26 stars 17 forks source link
dovecot dovecot-fts elasticsearch fts-elastic fulltextsearch

fts-elastic

fts-elastic is a Dovecot full-text search indexing plugin that uses ElasticSearch as a backend.

Dovecot communicates to ES using HTTP/JSON queries. It supports automatic indexing and searching of e-mail. For mailboxes with more than 10000 messages it uses elastic scroll API.

Packaging status

Requirements

Compiling

This plugin needs to compile against the Dovecot source for the version you intend to run it on. A dovecot-devel package is unfortunately insufficient as it does not include the required fts API header files.

You can provide the path to your source tree by passing --with-dovecot= to ./configure.

Install dependencies

# sudo apt install dovecot
sudo apt install gcc make libjson-c-dev dovecot-dev

An example build may look like:

./autogen.sh
./configure --with-dovecot=/usr/lib/dovecot/
make
make install
  sudo ln -s /usr/lib/dovecot/lib21_fts_elastic_plugin.so /usr/lib/dovecot/modules/lib21_fts_elastic_plugin.so

Configuration

Create /etc/dovecot/conf.d/90-fts.conf with content:

mail_plugins = $mail_plugins fts fts_elastic

plugin {
  fts = elastic
  fts_elastic = debug url=http://localhost:9200/m/ bulk_size=5000000 refresh=fts rawlog_dir=/var/log/fts-elastic/

# no indexes new emails when user make search
# yes indexes every email when delivered
  fts_autoindex = no
fts_autoindex_exclude = \Junk
fts_autoindex_exclude2 = \Trash
}

and (re)start dovecot:

dovecot stop; dovecot

ElasticSearch index

This plugin stores all message in one elastic index. You can use sharding to support large numbers of users. Since it uses routing key, updates and searches are accessing only one shard. _id is in the form "_id":"uid/mbox-guid/user@domain", example: "_id":"3/f40efa2f8f44ad54424000006e8130ae/filip.hanes@example.com"

You can setup index mapping on Elasticsearch 6.x with command

curl -X PUT "http://elasticIP:9200/m?pretty" -H 'Content-Type: application/json' -d "@elastic6-schema.json"

on Elasticsearch 7.x there is different date format parser, you need to use different schema:

curl -X PUT "http://elasticIP:9200/m?pretty" -H 'Content-Type: application/json' -d "@elastic7-schema.json"

Fields box and user needs to be keyword fields, as you can see in file elastic-schema.json. In our schema there is _source enabled because we don't see much storage savings when _source is disabled and elastic documentation doesn't recommend it either. This plugin doesn't use _source. It explicitly disables it in response queries, but you can use it for better management and insight to indexed emails or when you want to use elastic for other than dovecot fts (analysis, spammers detection, ...). In case of elastic reindexing _source will be needed.

Any time you can reindex users mailbox with doveadm commands;

doveadm fts rescan -u user@example.com
doveadm index -u user@domain -q '*'

An example of pushed document:

{
  "user": "filip.hanes@example.com",
  "box": "f40efa2f8f44ad54424000006e8130ae",
  "uid": 3,
  "date": "Thu, 08 Jan 2015 00:20:05 +0000",
  "from": "josh <josh@localhost.localdomain>",
  "sender": "Filip Hanes",
  "to": "<filip.hanes@example.com>",
  "cc": "User <user@example.com>",
  "bcc": "\"Test User\" <test@example.com>",
  "subject": "Test #3",
  "message-id": "<20150107132005.07DA3140314@example.com>",
  "body": "This is the body of test #3.\n"
}

An example search:

curl -X POST "http://elasticIP:9200/m/_search?pretty" -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"user": "filip.hanes@example.com"}},
        {"term": {"box": "f40efa2f8f44ad54424000006e8130ae"}}
      ],
      "must": [
        {
          "multi_match": {
            "query": "test",
            "operator": "and",
            "fields": ["from","to","cc","bcc","sender","subject","body"]
          }
        }
      ]
    }
  },
  "size": 100
}
'

TODO

Thanks

This plugin borrows heavily from dovecot itself particularly for the automatic detection of dovecont-config (see m4/dovecot.m4). The fts-solr and fts-squat plugins were also used as reference material for understanding the Dovecot FTS API. FTS-lucene was used as reference for implementing proper rescan.