MarcusSorealheis / Atlas-Search-Python

This is a Python Flask and MongoDB Atlas Search tutorial. Let me know if you have any questions at @marcusforpeace on twitter.
6 stars 1 forks source link
mongodb-atlas mongodb-atlas-search

The Common $regex Anti-Pattern

If you are building a full-text search app with MongoDB and use a case-insensitive regex, stop and re-consider MongoDB Atlas Search. You are probably paying for it in many ways, over-consumption of resources, angry customers, and more. Spin up a cluster on Atlas and create a Search Index on MongoDB Atlas Search with the click of a button.

About this Project

This project is actually a very simple fork of an existing project and blogpost where someone built a full-text search app in MongoDB using case-insensitive regex. Going forward, case-insensitive regex queries to power full-text search should be considered an anti-pattern. This repo is not an example of how to build a Flask App. Lots of the code could be improved. I read a blogpost about a search app built using $regex in and revised it to use Atlas Search. Hopefully this README makes it it easy to see how you could immediately get value from a small refactor to move case-insensitive regex to Atlas Search.

This revised repo is meant to demonstrate a few of the many benefits of moving most case-insentive regex queries ($regex) in MongoDB Atlas to MongoDB Atlas Search, a Lucene-powered search engine built for the job. After a list of benefits, there's a tutorial below, along with some sample code in this repo. You can find the regex query code in the regex_version branch, and the Atlas Search code in the fts_version branch.

Here's a picture of an Atlas Search fuzzy match, which would be exceedingly difficult and expensive to set up using the case-insenentive regex query shape. Apparently, there are lots of bagels near the MongoDB HQ:

Image of Atlas Search Fuzzy Match

The Benefits of $search Compared to $regex:

Building the App

Pre-Requisites:

  1. Clone this repo, cd into the project.

    cd Flask_Tuts

  2. Create and activate a virtual Environment.

    python3 -m venv mongo_atlas_regex_bad/

    source mongo_atlas_regex_bad/bin/activate

  3. Install the dependencies to run the app.

pip install -r requirements.txt

  1. Checkout the regex_version branch of the repo.

git checkout regex_version

  1. Add your connection string to a file in the root project called config.json.

{ "ATLAS_URI": "mongodb+srv://<db_user>:<db_password>@cluster0.xh91t.mongodb.net/?retryWrites=true&w=majority" }

  1. Load the sample data into the cluster using Compass or the Mongo Shell.

  2. Configure the Flask runtime environment and run the app.

FLASK_ENVIRONMENT=development

flask run

  1. Visit http://127.0.0.1:5000 in the browser and try out the search. Feel free to use your own query but for a consistent comparison, try entering kentucky in the name field, 10451 in the zip field, and 5 in the radius field then clicking Submit.

This should work fine but what about typo tolerance? Try the same query, except this time change the name input from kentucky to kentucke.efore clicking submit, click the clear results button in the bottom right of the map. No results.

If you were to add one million more restaurants, the query would be too slow to be usable. Your clusters are hurting from this experience and they should really move to the MongoDB product that is designed for the job. To replace the search experience with Atlas Search, let's checkout the fts_version branch of this repo:

git checkout fts_version.

  1. Head to the restaurants collection in Atlas and create a search index with the button on the far right side of the screen.

[helpful gif coming soon]

An Example Search Index Definition

There are many variations of a search index definition that you could use, but here is one to start:

// index name: rest_fts_sample

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "address":{
        "type": "document",
        "fields":{
            "coord":{
                "indexShapes": false,
                "type": "geo"
                }
        }
      },
      "name": {
        "analyzer": "lucene.standard",
        "type": "string"
      }
    }
  }
}

Here is the autocomplete index definition used in the project


// index name: rest_name_autocomplete_sample

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "address":{
        "type": "document",
        "fields":{
            "coord":{
                "indexShapes": false,
                "type": "geo"
                }
        }
      },
      "name": [
        {
          "foldDiacritics": true,
          "maxGrams": 15,
          "minGrams": 2,
          "tokenization": "edgeGram",
          "type": "autocomplete"
        }
      ]
    }
  }
}
  1. Add the credentials from step 5, althogh the connection code will look slightly different and like this:

db = pymongo.MongoClient("mongodb+srv://<username>:<password>@connection_string.mongodb.net/?retryWrites=true&w=majority").sample_restaurants

  1. Run the app again.

flask run

  1. Again, visit http://127.0.0.1:5000 in the browser and try out the search. Feel free to use your own query but for a consistent comparison, try entering kentucky in the name field, 10451 in the zip field, and 5 in the radius field then clicking Submit. Now, the greater number of results isn't totally due to Atlas Search's superioirity for the search use case, though one could argue.1 If you want to be underwhelmed, git checkout regex_version and try it again.

Try the kentucke search again. This, time, all the same results show up as in the previous correctly spelled search. That's because of the fuzzy parameter for the text operator.

For reference, here are the two very similar though not identical queries, with the clear winner in terms of performance, customizability, and user experience on the left:

MongoDB Atlas Search Case-Insensitive Regex
Typical GeoJSON Search Query Original GeoJSON Regex Query
        { 
          "$search": { 
            "index": "rest_fts_sample",
            "compound":  { 
              "must": { 
                "text": { 
                  "query": restname, 
                  "path": "name", 
                  "fuzzy": { 
                    "maxEdits":2
                } } },
              "should": { 
                "near": { 
                  "origin": { 
                    "type": "Point",
                    "coordinates": [ lat, lon ] 
                    }, 
                  "pivot": int(rad) * METERS_PER_MILE, 
                  "path": "address.coord"     
            } } } } }
    
{ "address.coord": { "$nearSphere": { "$geometry": { "type": "Point", "coordinates": [ lon, lat ] }, "$maxDistance": int(rad) * METERS_PER_MILE } }, "name": { "$regex": restname, "$options" : "i" } }


There's a lot more room for customization and improvement in this example, but this is an introduction into how you could use MongoDB Atlas Search to replace the case-insensitive $regex anti-pattern. I hope you enjoy. If you have any improvements to this repo, or want to share an index that you built with Atlas Search, please feel free to open a PR. I want to incorporate as much feedback as possible so that the MongoDB database can continue to free developers from the constraints of squeezall.

Footnote

  1. We could make the compound operator filter out results that are farther than the desired radius. The reason for this result set is due to the fact that location here is a score factor, not a filter factor. Most search engines use location in this manner to start and offer users the option to filter out results greater than a certain distance away from a point of interest.