gbv / coli-ana

API to analyze DDC numbers
https://coli-conc.gbv.de/coli-ana/app/
MIT License
2 stars 0 forks source link

Improve speed of lookups / reconsider lookups #28

Closed stefandesu closed 2 years ago

stefandesu commented 3 years ago

Lookups of notations which are contained in a lot of analyses take a long time. This is, however, not due to requesting the data, but only due to counting the total number of results for a query. For example, for the lookup of 3, there are about 378,000 results, and due to the nature of exact counts in PostgreSQL, counting these takes a few seconds.

I've tried to find a solution for this, but the only one I've found was to use table statistics for an estimated count for a given query, and for our particular use case those are vastly inaccurate. (For the above query, it returns the number 248,500, so it's off by more than 100k.)

Another question is whether we actually need the total number of results for a lookup. Is this number even helpful? Is there a use case to navigating to the last page, as it is currently possible?

I have two suggestions for potential solutions:

  1. For a given request, do not query the total number of results in the database, but query one more row than is actually requested. If that additional row exist, we return some information that "there are more results after this page" (i.e. the interface can show the next page button), if that additional row does not exist, then there are no more results after this (i.e. the interface should not show the next page button).

  2. Make separate requests for the results of a request and for the total count of results. Here, we would be able to show the results in the interface quickly, but the interface wouldn't know immediately how many results there are (since it might have to wait for the second request with that information). The big issue here is that it would only solve one half of the issue, mainly that it would be quicker for the user. But the server would still need to calculate the total number of results, and if there were a lot of lookups, it would severely slow things down. Not that we are expecting that, but we need to keep this in mind.

What do you think, @nichtich ?

nichtich commented 3 years ago

Use solution 1. Exact statistics can better be done offline, if needed.

stefandesu commented 3 years ago

Done in dev. The above lookup now is pretty instant instead of taking 24 seconds.

stefandesu commented 3 years ago

It's still slow. The cause seems to be order by uri; without it, the result is instant. However, for lookups I think it makes sense to return them sorted, so I'm not sure how to solve this.

stefandesu commented 3 years ago

I have also noticed that the Lookup functionality provides very little usefulness because for most numbers there are way too many results. Only for very specific numbers (where there are few results) it is actually useful and for those it should be fast.

stefandesu commented 2 years ago

There was also discussion about removing the backend database and simply rely on vc_day_srv. In that case, we wouldn't have lookups at all.

nichtich commented 2 years ago

Lookup has been removed.

stefandesu commented 2 years ago

Does this mean we can remove Lookup completely, i.e. including the backend code? So far, I've only removed the buttons.