ec-doris / kohesio-backend

APIs serving Kohesio's frontend
https://kohesio.ec.europa.eu
6 stars 2 forks source link

Keyword search limited to 5000 results #78

Open madewild opened 3 years ago

madewild commented 3 years ago

For instance: https://dev.kohesio.eu/projects?keywords=great

This is linked to the performance issues of the semantic search? Annoying because then the map is not representative of the overall situation...

madewild commented 2 years ago

Increasing leads to performance issues but currently the maps and filters are misleading... Need to think about it.

madewild commented 2 years ago

For instance https://dev.kohesio.eu/projects?keywords=%22road%22&country=Sweden gives only 6 results But we have at least 37: https://query.linkedopendata.eu/#select%20DISTINCT%20%3Fproject%20where%20%7B%0A%20%20%20%20%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP35%3E%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fentity%2FQ9934%3E%20.%0A%20%20%20%20%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP32%3E%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fentity%2FQ11%3E%20.%0A%20%20%20%20OPTIONAL%20%7B%7B%3Fproject%20rdfs%3Alabel%20%3Flabel%20filter%28lang%28%3Flabel%29%20%3D%20%27en%27%29%20%7D%7D%0A%20%20%20%20OPTIONAL%20%7B%7B%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP836%3E%20%3Fsummary%20filter%28lang%28%3Fsummary%29%20%3D%20%27en%27%29%20%7D%7D%0A%20%20%20%20FILTER%20%28regex%28%3Flabel%2C%20%22%5C%5Cbroad%5C%5Cb%22%2C%20%22i%22%29%20%7C%7C%20regex%28%3Fsummary%2C%20%22%5C%5Cbroad%5C%5Cb%22%2C%20%22i%22%29%29%0A%7D

madewild commented 2 years ago

The 5000 limit is configured in the SPARQL endpoint for the lucene part @D063520 @DiaZork @svili any idea how we could overcome this limitation without sacrificing performance too much?

madewild commented 2 years ago

@svili this is more long term but when you have time could you look into this? no easy way out but it would be important to improve the current situation at least a bit...

one idea would be to load the geo info of all projects in the background and continue computing the map while the 15 top paginated results are already displayed, but not sure how this would play with the UI

madewild commented 2 years ago

Now we have 112 790 results when searching for "youth" https://dev.kohesio.eu/projects?keywords=youth&sort=Total-Budget-(descending)

Very strange, and even the map tab has many projects! to investigate...

madewild commented 2 years ago

@svili could you investigate this?

madewild commented 2 months ago

Update: https://kohesio.ec.europa.eu/en/projects?keywords=great now loads 50.000 results instead of 5.000, but it takes a long time. Not sure what the ideal solution is. @Dgojsic any preference?

svili commented 2 months ago

@D063520 it seems we forgot to take out this temporary limit lift after all

Dgojsic commented 2 months ago

@madewild I think a higher threshold is preferable, just so it allows for visitors to get a better picture. In addition, i don't think people will use these ultra-broad terms like "good" to actually search for projects, so the likelihood of them encountering many waiting periods is probably low.

In addition, I did some testing with "housing" a keyword, and the 31k or so observations basically appeared instantaneously Vs with "good" which takes a few seconds to load indeed. Curious to find out how the limit is related to search duration Vs other things like the breadth of additional keywords using semantic search for example.

madewild commented 1 month ago

so @D063520 @svili what's the status now?

D063520 commented 1 month ago

Hi Max, I would keep it like this. Basically doing otherwise would transform the request from a transactional request to an analytical request. When you search in google for "great" he will also only give you the top results and not the results of all pages containing it

1) we keep like this, like most application do 2) we change it, but we are basically carrying out a lot of work for something that is not important for many users

madewild commented 1 month ago

I checked https://kohesio.ec.europa.eu/en/projects?keywords=housing and it's taking 20 seconds to load 30 794 results. @svili mentioned that the 50000 limit was a temporary threshold you forgot to remove in prod, what about that?

D063520 commented 1 month ago

yes you are right, that should not have moved to prod! it was a change that someone from regio wanted for a particular use case. I fixed it in dev. It will move to staging this night ....

Dgojsic commented 1 month ago

Indeed, the change was made because Roberto was indeed performing analysis using keywords on our end. If we have a lower limit in prod it is fine, however it would be useful if we have an option (even if just in dev) that allows us (and mostly just us, not the general user) to use a higher word limit in case we need to do analysis. In that case the response time also is not super important either, as it is just us using it.

D063520 commented 1 month ago

The problem is that currently this is is a parameter that is hard coded in the sparql endpoint and not super easy to fix. We can open an issue for that. Basically we need to make a pull request to this repo https://github.com/eclipse-rdf4j/rdf4j that is taking care of the query parsing and optimisation

D063520 commented 1 month ago

Closing this based on the feedback of Damir.