$regex
Anti-PatternIf you are building a full-text search app with MongoDB and use a case-insensitive regex, stop and re-consider MongoDB Atlas Search. You are probably paying for it in many ways, over-consumption of resources, angry customers, and more. Spin up a cluster on Atlas and create a Search Index on MongoDB Atlas Search with the click of a button.
This project is actually a very simple fork of an existing project and blogpost where someone built a full-text search app in MongoDB using case-insensitive regex. Going forward, case-insensitive regex queries to power full-text search should be considered an anti-pattern. This repo is not an example of how to build a Flask App. Lots of the code could be improved. I read a blogpost about a search app built using $regex in and revised it to use Atlas Search. Hopefully this README makes it it easy to see how you could immediately get value from a small refactor to move case-insensitive regex to Atlas Search.
This revised repo is meant to demonstrate a few of the many benefits of moving most case-insentive regex queries ($regex
) in MongoDB Atlas to MongoDB Atlas Search, a Lucene-powered search engine built for the job. After a list of benefits, there's a tutorial below, along with some sample code in this repo. You can find the regex query code in the regex_version
branch, and the Atlas Search code in the fts_version
branch.
Here's a picture of an Atlas Search fuzzy match, which would be exceedingly difficult and expensive to set up using the case-insenentive regex query shape. Apparently, there are lots of bagels near the MongoDB HQ:
$search
Compared to $regex
:Resource consumption - case-insensitive regex queries are expensive in any database engine. If you run them often and on even a modest dataset, e.g. ~50,000 documents, you will start to see performance hits on those queries and others. Atlas Search runs as a separate process in your replica set, mongot
, so your workload can continue per usual without unnecessary disruption from a computationally expensive query shape.
Speed - case-insensitive queries hurt the user experience of your application because they can be very slow. Atlas Search is built on Apache Lucene and optimized for the text search use case in ways that a database cannot be.
Autocomplete - users have grown accustomed to autcomplete in the search box. While you could hack together autocomplete for a case-insensitive regex queries, they would be slow, often inaccurate, and prohibitively expensive. I don't demonstrate that in this blogpost.
Fuzzy matching - typos are frequent, especially on those tiny mobile keyboards. Rest assured that if your user types a query with a typo, relevant results are still returned.
Diacritic folding - again, I am sure there is a way to do this with case-insensitive regex, but it cannot be as easy or predictable at it is in Atlas Search. There, you just need to include a boolean variable for the foldDiacritics
variable in your index definition.
Many search operators - perhaps the best feature you get from Atlas Search when compared to case-insensitive regex queries would be the collection of search operators. With them, you can craft a search experience that truly captures your user's intent.
$regex
to $search
, I recommend you set up the project using the instructions seen below because it will be challenging for some to get set up unless you are experienced with wrestling your Python environments on Mac OS X.Clone this repo, cd into the project.
cd Flask_Tuts
Create and activate a virtual Environment.
python3 -m venv mongo_atlas_regex_bad/
source mongo_atlas_regex_bad/bin/activate
Install the dependencies to run the app.
pip install -r requirements.txt
regex_version
branch of the repo.git checkout regex_version
config.json
.{ "ATLAS_URI": "mongodb+srv://<db_user>:<db_password>@cluster0.xh91t.mongodb.net/?retryWrites=true&w=majority" }
Load the sample data into the cluster using Compass or the Mongo Shell.
Configure the Flask runtime environment and run the app.
FLASK_ENVIRONMENT=development
flask run
kentucky
in the name field, 10451
in the zip field, and 5
in the radius field then clicking Submit
.This should work fine but what about typo tolerance? Try the same query, except this time change the name input from kentucky
to kentucke
.efore clicking submit, click the clear results button in the bottom right of the map. No results.
If you were to add one million more restaurants, the query would be too slow to be usable. Your clusters are hurting from this experience and they should really move to the MongoDB product that is designed for the job. To replace the search experience with Atlas Search, let's checkout the fts_version
branch of this repo:
git checkout fts_version
.
[helpful gif coming soon]
There are many variations of a search index definition that you could use, but here is one to start:
// index name: rest_fts_sample
{
"mappings": {
"dynamic": false,
"fields": {
"address":{
"type": "document",
"fields":{
"coord":{
"indexShapes": false,
"type": "geo"
}
}
},
"name": {
"analyzer": "lucene.standard",
"type": "string"
}
}
}
}
Here is the autocomplete index definition used in the project
// index name: rest_name_autocomplete_sample
{
"mappings": {
"dynamic": false,
"fields": {
"address":{
"type": "document",
"fields":{
"coord":{
"indexShapes": false,
"type": "geo"
}
}
},
"name": [
{
"foldDiacritics": true,
"maxGrams": 15,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
]
}
}
}
db = pymongo.MongoClient("mongodb+srv://<username>:<password>@connection_string.mongodb.net/?retryWrites=true&w=majority").sample_restaurants
flask run
kentucky
in the name field, 10451
in the zip field, and 5
in the radius field then clicking Submit
. Now, the greater number of results isn't totally due to Atlas Search's superioirity for the search use case, though one could argue.1 If you want to be underwhelmed, git checkout regex_version
and try it again. Try the kentucke
search again. This, time, all the same results show up as in the previous correctly spelled search. That's because of the fuzzy parameter for the text operator.
For reference, here are the two very similar though not identical queries, with the clear winner in terms of performance, customizability, and user experience on the left:
MongoDB Atlas Search | Case-Insensitive Regex |
---|---|
Typical GeoJSON Search Query | Original GeoJSON Regex Query |
{
"$search": {
"index": "rest_fts_sample",
"compound": {
"must": {
"text": {
"query": restname,
"path": "name",
"fuzzy": {
"maxEdits":2
} } },
"should": {
"near": {
"origin": {
"type": "Point",
"coordinates": [ lat, lon ]
},
"pivot": int(rad) * METERS_PER_MILE,
"path": "address.coord"
} } } } }
|
{ "address.coord": { "$nearSphere": { "$geometry": { "type": "Point", "coordinates": [ lon, lat ] }, "$maxDistance": int(rad) * METERS_PER_MILE } }, "name": { "$regex": restname, "$options" : "i" } } |
There's a lot more room for customization and improvement in this example, but this is an introduction into how you could use MongoDB Atlas Search to replace the case-insensitive $regex
anti-pattern. I hope you enjoy. If you have any improvements to this repo, or want to share an index that you built with Atlas Search, please feel free to open a PR. I want to incorporate as much feedback as possible so that the MongoDB database can continue to free developers from the constraints of sq
ueezall
.