elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.98k stars 24.75k forks source link

Add support for collapse sort to specify how to sort groups relative to each other #45646

Open casertap opened 5 years ago

casertap commented 5 years ago

This is a feature request.

At the moment, field collapsing will use the same scoring as the main query to choose the document that will appear in the main search results hits.

it would be good to have a score for the main query but another score to choose which one is going to be chosen within the field-collapsed set.

We can use inner_hit to sort the inner hits but this does not change the top document that has been chosen in the main hits. We can then manually choose the inner_hit..hits.hits[0] instead of the top hits.

I think the collapse feature would benefit from having a collapse_sort that will define the hit that will be chosen at the top level.

Thanks

elasticmachine commented 5 years ago

Pinging @elastic/es-search

mayya-sharipova commented 5 years ago

@casertap Thank you for filling the issue. From the issue description it is not clear to me what you are requesting. May be if you provide an example this would help.

Some other clarifications:

  1. Are you interested in sorting collapsed docs by textual score from a query or sorting by some field? I see you mentioned both of these in your description.
  2. We can use inner_hit to sort the inner hits but this does not change the top document that has been chosen in the main hits. ... I think the collapse feature would benefit from having a collapse_sort that will define the hit that will be chosen at the top level.

You can already do this. If you need to sort main collapsed hits just use sort parameter on the top level, like in this example

casertap commented 5 years ago

Hi @mayya-sharipova Thanks for your answer and for your link. I do not think this is what I am talking about.

I created a gist to try to explain the feature I would like to see: https://gist.github.com/casertap/9a2e6b9b1eee02b2b2111e6b446fb1ed

Please do not hesitate to ask more questions if this is not clear. Thank you

casertap commented 5 years ago

Basically, this would be bringing the top inner_hit to the search main hits during collapsing.

jimczi commented 5 years ago

I agree that it would be nice to be able to sort the group differently than the main sort but it's not a low hanging fruit. Today collapsing works in a single pass over the data because the sort within the group is the same than the main sort. If we allow a different sort we'll need two passes over the entire documents that match the query, the first pass to retrieve the top doc per group and the second pass to rank the remaining top docs. I'll mark this issue as adoptme because I don't have time to work on this now but I'd be happy to help if someone wants to tackle this.

kehphin commented 4 years ago

Hi,

We would find this addition very useful as well. Our use case is that we want to sort on most recent document of each collapsed bucket. Currently we reach a dead end as follows:

  1. Use collapse to bucket documents by their name
  2. Use inner_hits.sort with size: 1 to order the bucket by the most recent first and grabbing the first entry
  3. Sort the results with an outer level sort.

Currently, Step 3 will not sort the results by the first item in the inner_hits, but by the top document chosen by ES for each collapse bucket, which leads to incorrect results if the document chosen by ES is different than the first document in the ordered inner hits.

@jimczi & @mayya-sharipova - Do you have any updates on the timeline for addressing this issue? If not, are you familiar with what areas of the code need to be changed and the level of effort to tackle this feature?

Thanks!

erayarslan commented 3 years ago

Hi @mayya-sharipova, @jimczi

I started working on collapse sort feature. While working on this feature, i had some questions so I decided to ask it to you before sending a pull request. Here are some approaches I tried.

Example Data;

{
    "brand": "Adidas",
    "price": 10,
    "order": 1
}
{
    "brand": "Adidas",
    "price": 5,
    "order": 2
}
{
    "brand": "Nike",
    "price": 3,
    "order": 1
}

Expected Result;

Group documents by brand and sort documents by the price of document with lowest order for each group

Steps:

1) Group by brand 2) Sort document in each group by lowest order 3) Select top doc of each group and sort groups by highest price

Results should looks like this.

{
    "brand": "Adidas",
    "price": 10,
    "order": 1
}
{
    "brand": "Nike",
    "price": 3,
    "order": 1
}

Current query:

{
    "sort": [
        {
            "price": {
                "order": "desc"
            },
            "order": {
                "order": "asc"
            }
        }
    ],
    "collapse": {
        "field": "brand"
    }
}

This query sorts both groups and documents in each group by same fields(price:desc, order:asc). But we need differents sorting fields for groups and documents in each group. Because CollapsingTopDocsCollector.java -> FirstPassGroupingCollector.java class of Lucene does not support two sorting options.

So this is my proposal to add functionality to sort groups by fields of top document when using field collapsing by separating sorting of collapsed groups and sorting of documents in each group from each other.

{
    "sort": [                      <- Top Sort
        {
            "price": {
                "order": "desc"
            }
        }
    ],
    "collapse": {
        "field": "brand",
        "sort": [                  <- Collapse Sort
            {
                "order": {
                    "order": "asc"
                }
            }
        ]
    }
}

And i tried two solutions:

Solution 1: After reading @jimczi 's comment, i created a new class(SortingCollectedSearchGroups) that implements SecondPassGroupingCollector class. 1) Run CollapsingTopDocsCollector(Extends from FirstPassGroupingCollector) with Collapse Sort if collapse sort exists in the query (if does not exist it continues to use top sort) 2) After first pass finished its job, i run a second pass by using SortingCollectedSearchGroups class to sort for top docs in each group which are coming from the first pass. This uses top sort if top sort exists in the query. If top sort does not exist, it uses Sort.RELEVANCE.

This solution worked fine at first. But after some tests its failed on pagination :/

Because first pass still supports single sort option and lucene uses single sort option to sort both groups and documents in groups in FirstPassGroupingCollector.java.

So second pass implementation is quite useless in this case.

Do you think is there anything I am missing about SecondPassGroupingCollector?

Solution 2: I worked on FirstPassGroupingCollector to support both top sort and collapse sort. But this code is on Lucene. 1) For my PoC, I copied FirstPassGroupingCollector to Elasticsearch and added suppor for collapse sort. 2) After that, I changed CollapsingTopDocsCollector to extend from my custom FirstPassGroupingCollector instead of current FirstPassGroupingCollector.

This solution worked fine and covered pagination.

So should I open a pull request to Lucene to change FirstPassGroupingCollector or should I copy and FirstPassGroupingCollector to Elasticsearch as I explained above. Is there anything I am missing or is there any other solution that you can suggest?

Thanks.

wolffclw commented 3 years ago

@erayarslan I am also in need of this feature. I'm wondering if it is possible to package your solution as a plugin. If so, users in need of this could take advantage of the plugin while the support of it into the main product is being worked.

erayarslan commented 3 years ago

@wolffclw I don't intend to continue the solutions as plugin. Here is my draft commits for these solutions. (no unit test and not production ready, just a draft for proposal)

Solution 1 https://github.com/erayarslan/elasticsearch/commit/de4e9d2bd5059759921ca5b35d43bf9419c84271 Solution 2 https://github.com/erayarslan/elasticsearch/commit/0c5948de182f8d37c98066ccd40f405f47b15d33

@jimczi & @mayya-sharipova Can you please take a look?

I need a suggestion for my proposal before I continue.

Thanks.

PEscar commented 2 years ago

i am looking for this feature....

I need to collapse by one field, but choose collapsed doc by a condition.

fadedDexofan commented 2 years ago

Much needed functionality. @mayya-sharipova @jimczi can you help @erayarslan finish this?

Yashas-H-S commented 2 years ago

I am migrating from solr to ElasticSearch, and this was one of the key features which I found missing. Would be great if we have this in Elastic

erayarslan commented 2 years ago

I think I can move forward with Solution 2 after b0a829ed1300afcc42d9f431e72bbffcd3cc8a05 change. I sent (#89331) pull request about it. Waiting for your feedbacks. Thanks.

mayya-sharipova commented 2 years ago

@erayarslan Sorry for a late reply, this feature completely slipped under the radar. But we would like to resume the work on it.

Let's first finalize the API. Is this a good summary of your proposal?


Add sort parameter within the collapse group to define the way to sort collapse groups.

Given the documents:

{
    "brand": "Adidas",
    "price": 10,
    "order": 1
}
{
    "brand": "Adidas",
    "price": 5,
    "order": 2
}
{
    "brand": "Nike",
    "price": 3,
    "order": 1
}

We need to: 1) Group by brand 2) Sort document in each group by lowest order 3) Select top doc of each group and sort groups by highest price

The query will be (note, it is a query is different from the one you proposed here):

{
    "sort": [                      <- Top Sort 
        {"order": "desc"}
    ],
    "collapse": {
        "field": "brand",
        "sort": [                  <- Collapse Sort
            {"price": "asc"}
        ]
    }
}
elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

erayarslan commented 2 years ago

@mayya-sharipova Yes, Its very good summary. And thanks for query changes. I think it's better. So should i change my pr based on the your query contract above?

alfuken commented 2 years ago

Hey, nice to see this getting some love! While we're at it, may I ask/suggest a slight extension to the discussed feature?

Ability to sort over an explicitly provided order of elements.

In the provided example, to the three example products add a field “partner_id” with corresponding values 1, 2, and 3

In the provided example, the adjusted query would look like this:

{
    "sort": [                      <- Top Sort 
        {"order": "desc"}
    ],
    "collapse": {
        "field": "brand",
        "sort": [                  <- Collapse Sort
            {"provider_id": [4, 2, 777, 1, 3]}
        ]
    }
}

A result would be record #2, with provider_id==2 and price 5

Thank you in advance! :)

mayya-sharipova commented 2 years ago

@alfuken Sorry, I did not understand what you mean from your example. I think it is better not to complicate the discussion. Can you please file a separate issue, and provided more detailed context (sample docs, query, and a sample response).

mayya-sharipova commented 2 years ago

So should i change my pr based on the your query contract above?

@erayarslan Actually, I thought more about it, and I like your design better. Let's move discussion to your PR.

alfuken commented 2 years ago

What I mean is literally this kind of sorting: https://stackoverflow.com/a/6332081 and https://stackoverflow.com/a/8322898

Why I filed this as a part of this issue? Because it’s basically just a third argument to ordering: there is “asc”, “desc”, and (what I’m talking about) an “explicitly ordered list of values”, as simple as that. Those two links to SO explain better than I can.

On Mon 12. Sep 2022 at 18:06, Mayya Sharipova @.***> wrote:

@alfuken https://github.com/alfuken Sorry, I did not understand what you mean from your example. I think it is better not to complicate the discussion. Can you please file a separate issue, and provided more detailed context (sample docs, query, and a sample response).

— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/45646#issuecomment-1243960263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASM4IM5AXFPGUYQFPWIPLV55IHFANCNFSM4IMFPEDA . You are receiving this because you were mentioned.Message ID: @.***>

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Abhineeth09 commented 3 months ago

Hi @mayya-sharipova, any updates on if this feature is being worked on? This feature will be very critical to the work my team is doing. I see that a PR is open and is awaiting feedback, any updates on getting it merged will be helpful.

GeeWee commented 4 days ago

Hey @mayya-sharipova / @mark-vieira any updates or timelines on this? It's also crucial for our adoption of ElasticSearch

edstripe commented 3 days ago

Hi @mayya-sharipova @mark-vieira are there any updates on this? We also need to implement this, it is also crucial for our use of ElasticSearch, any update on time would be very useful to us, thanks! Appreciated