Open Hanserfaust opened 3 years ago
As I cant see any implicit reference to $option i must assume it is done implictly.
Lead me to think about re.compile setting an implicit option re.UNICODE ?
https://docs.python.org/3/library/re.html#re.Pattern.flags
possibly here? https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/fields.py#L163
(The re.compile() expression is stored on the query until execution (in pymongo?) as far as I can see.)
Again, not saying this is wrong, but my indexes are not hit with that option being set.
"Solved" after a day of tracking this down, doing it like this enabled fast (indexed) query again for my case.
Replaced the re.compile() call to avoid having the re.UNICODE option implicitly set when executing the query. (The flag is set by module sre_parse method fix_flags(), by a call from re._compile() if you need to know).
Monkey-patch solution: Not the prettiest way to do it, but it proves my point. If tested/accepted by a maintainer, I could do a pull-reqest on my own time.
Define this in me_mp.py
import re
from mongoengine.queryset.transform import STRING_OPERATORS
from mongoengine.fields import BaseField
"""
Patch to StringField that avoids using the re.compile() which could
implicitly set the re.UNICODE (u) $options causing slow (non-indexed) regex-queries
vs. mongoDB
"""
def prepare_query_value(self, op, value):
if not isinstance(op, str):
return value
if op in STRING_OPERATORS:
case_insensitive = op.startswith("i")
op = op.lstrip("i")
regex = r"%s"
if op == "startswith":
regex = r"^%s"
elif op == "endswith":
regex = r"%s$"
elif op == "exact":
regex = r"^%s$"
value = {
'$regex': regex % re.escape(value),
}
if case_insensitive:
value['$options'] = 'i'
return BaseField.prepare_query_value(self, op, value)
In your models.py, monkey-patch StringField mathod like this:
from me_mp import prepare_query_value
StringField.prepare_query_value = prepare_query_value
Hi, thanks for reporting this (and investigating it), please open an MR, FYI this old MR seems to be related https://github.com/MongoEngine/mongoengine/pull/1692 but it was left unfixed so male sure to have a look at it (so we can also discard the old one afterward)
Hi. It is the same problem. As far as I can see (in 0.20.), re.compile() is only used in a few places (with the one I replaced possibly the only place negatively affecting index usage.). I'll open an MR.
Hi, I am facing the same issue. Was this fixed in any subsequent versions?
@bagerard Im leaving the project where I encountered this problem.
The below coded fixes this, based on master as of 2023-01-29, if anyone wants to contribute it.
Thanks for writing this fine piece of software.
In mongoengine.fields.StringField():
def prepare_query_value(self, op, value):
if not isinstance(op, str):
return value
if op in STRING_OPERATORS:
case_insensitive = op.startswith("i")
op = op.lstrip("i")
flags = re.IGNORECASE if case_insensitive else 0
regex = r"%s"
if op == "startswith":
regex = r"^%s"
elif op == "endswith":
regex = r"%s$"
elif op == "exact":
regex = r"^%s$"
elif op == "wholeword":
regex = r"\b%s\b"
elif op == "regex":
regex = value
if op == "regex":
value = re.compile(regex, flags)
else:
"""
REMOVE IF PULL REQUEST ACCEPTED:
Fix to StringField that avoids using the re.compile() which could
implicitly set the re.UNICODE (u) $options causing slow (non-indexed)
performance when using regex-based queries, like "startswith" etc.
There is really no way of testing this other than observing that queries
are slow (not hitting indexes) without this fix.
"""
# escape unsafe characters which could lead to a re.error
value = {
'$regex': regex % re.escape(value),
}
if case_insensitive:
value['$options'] = 'i'
return super().prepare_query_value(op, value)
On Mongoengine 0.20.0 and mongodb 4.4.0
Im doing a fairly involved query setting up a couple of Qs (and/or-ing) , some of them using "startswith" operation that translates into the $regex with ^-anchor. According to doc we should be able to use the indexes in that case:
https://docs.mongodb.com/manual/reference/operator/query/regex/#mongodb-query-op.-regex
(Ie. q = Q(unit_fqn__startswith=fqn))
It was very slow fetching counts or results.
Using the explain() method I figured out the query built by mongoengine and tried executing it separately using the mongo shell. I found out that it was the '$options': 'u' (that is re.UNICODE) causing the slow query on the regex search.
EXAMPLE:
First query is very slow (many seconds), index not used (according to explain()), second is close to instant (collection size 750K documents):
1) (SLOW) db.employee.find( {'$and': [{'$or': [{'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Portugal˛', '$options': 'u'}}, {'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Filippinerna˛'}}]}, {'hidden': {'$eq': false}}]})
2) (FAST) db.employee.find( {'$and': [{'$or': [{'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Portugal˛'}}, {'unit_fqn': {'$regex': '^Comapny\ Global\ \(TEST\ KP\)˛Filippinerna˛'}}]}, {'hidden': {'$eq': false}}]})
Not saying this is a bug, but this was unexpected behaviour as Im not doing anything out of the ordinary. I have not yet found out what is causing the option to be set (see comment below though). What am I missing?