MongoEngine / mongoengine

A Python Object-Document-Mapper for working with MongoDB
http://mongoengine.org
MIT License
4.23k stars 1.23k forks source link

Slow 'startswith'-queries due to '$options': 'u' being set (no index usage?) #2512

Open Hanserfaust opened 3 years ago

Hanserfaust commented 3 years ago

On Mongoengine 0.20.0 and mongodb 4.4.0

Im doing a fairly involved query setting up a couple of Qs (and/or-ing) , some of them using "startswith" operation that translates into the $regex with ^-anchor. According to doc we should be able to use the indexes in that case:

https://docs.mongodb.com/manual/reference/operator/query/regex/#mongodb-query-op.-regex

(Ie. q = Q(unit_fqn__startswith=fqn))

It was very slow fetching counts or results.

Using the explain() method I figured out the query built by mongoengine and tried executing it separately using the mongo shell. I found out that it was the '$options': 'u' (that is re.UNICODE) causing the slow query on the regex search.

EXAMPLE:

First query is very slow (many seconds), index not used (according to explain()), second is close to instant (collection size 750K documents):

1) (SLOW) db.employee.find( {'$and': [{'$or': [{'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Portugal˛', '$options': 'u'}}, {'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Filippinerna˛'}}]}, {'hidden': {'$eq': false}}]})

2) (FAST) db.employee.find( {'$and': [{'$or': [{'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Portugal˛'}}, {'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Filippinerna˛'}}]}, {'hidden': {'$eq': false}}]})

Not saying this is a bug, but this was unexpected behaviour as Im not doing anything out of the ordinary. I have not yet found out what is causing the option to be set (see comment below though). What am I missing?

Hanserfaust commented 3 years ago

As I cant see any implicit reference to $option i must assume it is done implictly.

Lead me to think about re.compile setting an implicit option re.UNICODE ?

https://docs.python.org/3/library/re.html#re.Pattern.flags

possibly here? https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/fields.py#L163

(The re.compile() expression is stored on the query until execution (in pymongo?) as far as I can see.)

Again, not saying this is wrong, but my indexes are not hit with that option being set.

Hanserfaust commented 3 years ago

"Solved" after a day of tracking this down, doing it like this enabled fast (indexed) query again for my case.

Replaced the re.compile() call to avoid having the re.UNICODE option implicitly set when executing the query. (The flag is set by module sre_parse method fix_flags(), by a call from re._compile() if you need to know).

Monkey-patch solution: Not the prettiest way to do it, but it proves my point. If tested/accepted by a maintainer, I could do a pull-reqest on my own time.

Define this in me_mp.py

import re
from mongoengine.queryset.transform import STRING_OPERATORS
from mongoengine.fields import BaseField

"""
    Patch to StringField that avoids using the re.compile() which could
    implicitly set the re.UNICODE (u) $options causing slow (non-indexed) regex-queries
    vs. mongoDB
"""

def prepare_query_value(self, op, value):
    if not isinstance(op, str):
        return value

    if op in STRING_OPERATORS:
        case_insensitive = op.startswith("i")
        op = op.lstrip("i")

        regex = r"%s"
        if op == "startswith":
            regex = r"^%s"
        elif op == "endswith":
            regex = r"%s$"
        elif op == "exact":
            regex = r"^%s$"

        value = {
            '$regex': regex % re.escape(value),
        }
        if case_insensitive:
            value['$options'] = 'i'

    return BaseField.prepare_query_value(self, op, value)

In your models.py, monkey-patch StringField mathod like this:


from me_mp import prepare_query_value
StringField.prepare_query_value = prepare_query_value
bagerard commented 3 years ago

Hi, thanks for reporting this (and investigating it), please open an MR, FYI this old MR seems to be related https://github.com/MongoEngine/mongoengine/pull/1692 but it was left unfixed so male sure to have a look at it (so we can also discard the old one afterward)

Hanserfaust commented 3 years ago

Hi. It is the same problem. As far as I can see (in 0.20.), re.compile() is only used in a few places (with the one I replaced possibly the only place negatively affecting index usage.). I'll open an MR.

abhi19gupta commented 2 years ago

Hi, I am facing the same issue. Was this fixed in any subsequent versions?

Hanserfaust commented 1 year ago

@bagerard Im leaving the project where I encountered this problem.

The below coded fixes this, based on master as of 2023-01-29, if anyone wants to contribute it.

Thanks for writing this fine piece of software.

In mongoengine.fields.StringField():

def prepare_query_value(self, op, value):
    if not isinstance(op, str):
        return value

    if op in STRING_OPERATORS:
        case_insensitive = op.startswith("i")
        op = op.lstrip("i")

        flags = re.IGNORECASE if case_insensitive else 0

        regex = r"%s"
        if op == "startswith":
            regex = r"^%s"
        elif op == "endswith":
            regex = r"%s$"
        elif op == "exact":
            regex = r"^%s$"
        elif op == "wholeword":
            regex = r"\b%s\b"
        elif op == "regex":
            regex = value

        if op == "regex":
            value = re.compile(regex, flags)
        else:
            """
                REMOVE IF PULL REQUEST ACCEPTED:

                Fix to StringField that avoids using the re.compile() which could
                implicitly set the re.UNICODE (u) $options causing slow (non-indexed)
                performance when using regex-based queries, like "startswith" etc.

                There is really no way of testing this other than observing that queries
                are slow (not hitting indexes) without this fix.
            """

            # escape unsafe characters which could lead to a re.error
            value = {
                '$regex': regex % re.escape(value),
            }
            if case_insensitive:
                value['$options'] = 'i'

    return super().prepare_query_value(op, value)