apache / couchdb

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
https://couchdb.apache.org/
Apache License 2.0
6.2k stars 1.03k forks source link

Slow `_find` query when no documents match selector #1549

Closed ghost closed 6 years ago

ghost commented 6 years ago

_find returns results within 400-800ms when documents match the selector, but takes > 25000ms when no documents match the selector.

Expected Behavior

Expect result response to be at least as responsive when no documents match selector.

Current Behavior

Response takes >25000ms to return if no documents match the selector.

Possible Solution

Nil

Steps to Reproduce (for bugs)

Database contains 15k+ documents of athlete news.

Sample document:

{
  "_id": "athlete.news:news.mlb.sdi.662908",
  "_rev": "1-e0e6c8cef543dcc3b3773c91b225b0a2",
  "meta": {
    "type": "news.athlete",
    "source": "xxxx",
    "xxxx.source": {
      "player-news": "player-news:mlb.20180607"
    },
    "updatedTime": "2018-06-22T00:58:38+00:00",
    "updatedMicroseconds": 1529629118000000
  },
  "league": "MLB",
  "athlete": {
    "id": "athl.mlb.sdi.114027",
    "name": "Wei-Yin Chen",
    "firstName": "Wei-Yin",
    "lastName": "Chen",
    "images": {
      "small": "https://www.xxxxxxxxx.com/usatsimg/image/thumb/250-250/9944051.jpg",
      "large": "https://www.xxxxxxxxx.com/usatsimg/image/thumb/650-650nw/9944051.jpg"
    }
  },
  "team": {
    "id": "team.mlb.sdi.2963",
    "name": "Miami",
    "nickname": "Marlins",
    "abbreviation": "MIA",
    "displayName": "Miami Marlins"
  },
  "dateTime": "2018-06-07T00:22:33-07:00",
  "priority": 3,
  "headline": "Falters in fifth in no-decision",
  "update": "Chen allowed two runs on six hits and two walks with two strikeouts over 4.1 innings in a no-decision Wednesday against St. Louis.",
  "analysis": "Chen has completed the fifth inning just three times in eight starts this season. When a walk and a single left Chen at 91 pitches with one out in the fifth, the Marlins knew his night needed to end before the four-run lead they had built dissipated into nothing. Chen induced seven groundballs, but three of the six he gave up went for extra bases, and he has reliever Drew Rucinski to thank for cleaning up his mess in the fifth. Chen owns a 5.86 ERA and will look to pick up his first win since his first start of the season when he takes the mound again Monday against the Giants.",
  "injured": false,
  "publishTime": 1528356153
}

This is the index: POST /{db}/_index

{
    "index": {
        "fields": [{
            "publishTime": "desc"
        }]
    },
    "name": "news-by-dateTime-index",
    "type": "json"
}

The query selector: POST /{db}/_find

{
    "selector":{
        "league":{
            "$in":["MLS","EPL"]
        },
        "priority":{
            "$lte":3
        }
    },
    "fields":["_id","dateTime","priority","league","athlete","team","headline","update","analysis","injured","publishTime"],
    "sort":[{"publishTime":"desc"}],
    "limit":100,
    "bookmark": null,
    "execution_stats": true,
    "use_index":"_design/441fec348e12ff72dbb2563b15af4391ff665c9d"
}
  1. Use the league selector to find all "MLB" news. If documents match, results are returned within 400-800ms.
    {
    "docs": [
        { ... }, {... }, ...
    ],
    "bookmark": "g1AAAABWeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYorJJZk5KSWpOrlpZYXW4EIvdycJL3ilEw9M3NTS0sjkD4OmL4coA7GpOhspvNZWQBdEBpM",
    "execution_stats": {
        "total_keys_examined": 0,
        "total_docs_examined": 152,
        "total_quorum_docs_examined": 0,
        "results_returned": 100,
        "execution_time_ms": 350.547
    }
    }
  2. Use the league selector to find all "EPL" news. If no documents match, an empty result is returned after more than 25000ms.
    {
    "docs": [],
    "bookmark": "nil",
    "execution_stats": {
        "total_keys_examined": 0,
        "total_docs_examined": 24856,
        "total_quorum_docs_examined": 0,
        "results_returned": 0,
        "execution_time_ms": 34390.737
    }
    }
  3. Makes no difference if selector is changed to $eq
    {
    "selector":{
        "league":{
            "$eq":"EPL"
        },
    ...

Context

This data is returned to a frontend client directly via a websocket request and affects the app's responsiveness.

Your Environment

wohali commented 6 years ago

Hi @CWX-iggy, thanks for your issue.

Unfortunately, you need to know that the $in operator has to do a full scan on every query on every document in the index. This cannot be pre-determined given the way Mango builds indexes. Mango will index the full contents of the field you are searching on (here, league) but has to at query time do a manual comparison of the contents of that index, per-document, with your parameters at query time. You will experience similar poor performance with other inexact operators, such as the $regex operator. @tonysun83 can you confirm my understanding of Mango indexing here?

I agree this is not clearly called out in the documentation, and for that I apologise. Would you agree with me converting your issue to one that would let us improve our documentation on this point?

Your choices to improve performance are to use Mango partial indexing to reduce the number of documents in your index (best option), use $eq or another exact operator instead of $in (next best, but you may not be able to capture your query, or try something like Clouseau/Dreyfus for full text search capability instead (hardest to set up, requires JDK6 and Apache Lucene, but is more flexible in terms of partial term searching). This last option lets you use text-based Mango indexes if you want to stay with Mango.

I recommend starting with partial indexing. You can use this to make multiple sub-indexes that reduce the likelihood of few documents being found. Partial indexes are especially good to filter documents by type, for which it looks like your meta.type or league fields might be good candidates. You may even find that running 2 mango queries (for instance, across 2 leagues) across 2 partial indexes and then merging the results in your client / app layers will be faster than running the single query across the entire content database.

wohali commented 6 years ago

One other thing I noticed - your query may not be using the right index. Can you check with /db/_explain that the index is indeed being used? I don't see you indexing the field you're querying against here, i.e. league.

Documentation on the explain endpoint is here: http://docs.couchdb.org/en/2.2.0/api/database/find.html#db-explain

ghost commented 6 years ago

@wohali I suspect that may the issue - no index on league (or priority). How do I specify multiple use_index? This doesn't seem to work:

{..."use_index": ["_design/filter_by_league","_design/sort_by_date"]}

This is the response from /db/_explain

{
    "error": "error",
    "reason": "{invalid_object,{<<\"\ufffd\ufffd\ufffd\ufffd\">>}}"
}

Fauxton simply reports error fetching query plan.

wohali commented 6 years ago

You can't. Mango underneath is using the same index file format as MapReduce - you're still querying only a single index at once. You'll need to index all the fields you need in a single index.

That's weird on your /db/_explain response, you may need to upgrade to 2.2.0 for some bugfixes. Or I just realized - you might be trying to send it the query with a list-based use_index, which is invalid as I explained.

Closing as this isn't strictly a Mango issue.

wohali commented 6 years ago

Happy to keep discussing it with you in this ticket if you like.

garrensmith commented 6 years ago

@wohali is right that the query that takes long is doing a full index scan on _all_docs as you can see from the stats:

"execution_stats": {
        "total_keys_examined": 0,
        "total_docs_examined": 24856,
        "total_quorum_docs_examined": 0,
        "results_returned": 0,
        "execution_time_ms": 34390.737
    }

The one that is faster is because it uses the "publishTime" index. Mango needs a way to know to select that index on your queries. The best way to do that is to add a publishTime into your selector even if its something like publishTime: {$gt: null}. That way it should always select that index. You can also add in the use_index to specify that index.

But running _explain on your queries will give you some information around what is happening internally and which index is being selected.

Since you using league in your query it might be worth adding that into the index as well even if you only use it with the $in selector. That way only the docs with a league field and a publishTime field will be in your index.

willholley commented 6 years ago

just to clarify why the index is being used in the first instance, "sort":[{"publishTime":"desc"}] adds an implicit "publishTime": { "$exists": true } to the "selector". If you're not specifying the sort in subsequent queries, you could add the selector clause explicitly to allow the index to be used (as @garrensmith suggested above).

It sounds like the most performant option would be to create an index on ["league", "publishTime"] and issue multiple requests - one for each league. You'd then need to merge/sort the results on the client.