crnk-project / crnk-framework

JSON API library for Java
Apache License 2.0
289 stars 154 forks source link

Faceted Search: Why are all facets evaluated for every call? #573

Open nwinkler opened 4 years ago

nwinkler commented 4 years ago

I'm playing around with the experimental faceted search feature (see #413 and #421) - looks great, pretty close to what I'm looking for. I basically wanted to use something like this for running aggregations on my stored entities, basically like SQL's GROUP BY functionality. While it does what I'm looking for, it seems to be rather inefficient in how it prepares the results.

Consider this entity class (omitted constructors and getters/setters):

public class Foo {
    @JsonApiId
    private in id;

    @Facet
    private int accountID;

    @Facet
    private int companyID;

    @Facet
    private String currency;
}

I want to be able to group by either one of the three attributes, so basically getting a count of all Foo instances grouped by either account, company, or currency.

Crnk provides this handy resource that allows to specify a specific grouping: http://localhost:8080/api/facet/foo_currency

This returns the results in the form that I expect them to be in:

{
  data: {
    id: "foo_currency",
    type: "facet",
    values: {
      EUR: {
        label: "EUR",
        value: "EUR",
        filterSpec: {
          path: "currency",
          operator: "EQ",
          value: "EUR",
          expression: null
        },
        count: 12
      },
      GBP: {
        label: "GBP",
        value: "GBP",
        filterSpec: {
          path: "currency",
          operator: "EQ",
          value: "GBP",
          expression: null
        },
        count: 15
      }
    },
    name: "currency",
    groups: { },
    resourceType: "foo",
    labels: [
      "EUR",
      "GBP"
    ],
    links: {
      self: "http://localhost:8083/facet/foo_currency"
    }
  }
}

When stepping through the code to identify how I can bind this to my backend data store, I noticed that Crnk calls the FacetProvider.findValues(FacetInformation facetInformation, QuerySpec querySpec) method multiple times, once for each defined facet. So in my case, it calls it once for accountID, once for companyID, and then a third time for currency. All grouping results from the three queries are put into a result list, and then in a final step, only the foo_currency item is kept, while the other two results are dropped.

This seems highly inefficient to me. I'm requesting one piece of information (group the Foo instances by currency) and Crnk runs three separate grouping queries (accountID, companyID, currency) for this. If we have a backend data store with a huge amount of data behind this, the current behavior will result in a couple of pretty expensive backend queries, with most of the results neither required nor used.

Within my own FacetProvider' s findValues method, I did not see any means to understand where the call was originating from, so that I could potentially only run the one query that was requested (currency), and returning empty results for the other ones that are not needed.

Am I missing something, or am I doing this wrong? Please let me know if you need more information about my use case.

(I'm really grateful for the facet support so far, and I understand that it's experimental so far - thanks for building this functionality! As requested in the documentation, I'm trying to provide feedback here, please don't take this as criticism...)

remmeier commented 4 years ago

feedback is very welcomed :-) to your question, the various FacetProvider are used by FacetRepositoryImpl. Among various things, its contains a applyQuickFilter method. Currently this honors name and resourceType, but not yet id which is a combination of those two fields. So a little extension their should bring the desired performance boost.

many use cases show all facets on the left (like Amazon), so this issue of requesting/optimizing a single one has not yet come up.

nwinkler commented 4 years ago

Thanks for the pointer - I'll take a look!

Our use case is more in the area of let's say a Dashboard, where you have various charts showing information, including some aggregations on the data. Each widget on the dashboard fires its own request, and on the backend, we want each request to fetch only the data it needs, nothing else...

nwinkler commented 4 years ago

I played around with this a bit more, and I noticed that instead of using this (which triggers all facets):

http://localhost:8080/api/facet/foo_currency

I can use this, which only triggers the currency facet:

http://localhost:8080/api/facet?filter[name]=currency

This works fine for me, a lot better than the first link. The one issue that I see with that is the generated self link, which uses the first link:

{
  "data": [{

    "links": {
      "self": "http://localhost:8080/api/facet/foo_currency"
    }
  }]
}
remmeier commented 4 years ago

yeah, like this it works. But addressing it directly by ID should be supported/fixed as well.

remmeier commented 4 years ago

can take care of it the coming days