Open madewild opened 3 years ago
Increasing leads to performance issues but currently the maps and filters are misleading... Need to think about it.
For instance https://dev.kohesio.eu/projects?keywords=%22road%22&country=Sweden gives only 6 results But we have at least 37: https://query.linkedopendata.eu/#select%20DISTINCT%20%3Fproject%20where%20%7B%0A%20%20%20%20%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP35%3E%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fentity%2FQ9934%3E%20.%0A%20%20%20%20%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP32%3E%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fentity%2FQ11%3E%20.%0A%20%20%20%20OPTIONAL%20%7B%7B%3Fproject%20rdfs%3Alabel%20%3Flabel%20filter%28lang%28%3Flabel%29%20%3D%20%27en%27%29%20%7D%7D%0A%20%20%20%20OPTIONAL%20%7B%7B%3Fproject%20%3Chttps%3A%2F%2Flinkedopendata.eu%2Fprop%2Fdirect%2FP836%3E%20%3Fsummary%20filter%28lang%28%3Fsummary%29%20%3D%20%27en%27%29%20%7D%7D%0A%20%20%20%20FILTER%20%28regex%28%3Flabel%2C%20%22%5C%5Cbroad%5C%5Cb%22%2C%20%22i%22%29%20%7C%7C%20regex%28%3Fsummary%2C%20%22%5C%5Cbroad%5C%5Cb%22%2C%20%22i%22%29%29%0A%7D
The 5000 limit is configured in the SPARQL endpoint for the lucene part @D063520 @DiaZork @svili any idea how we could overcome this limitation without sacrificing performance too much?
@svili this is more long term but when you have time could you look into this? no easy way out but it would be important to improve the current situation at least a bit...
one idea would be to load the geo info of all projects in the background and continue computing the map while the 15 top paginated results are already displayed, but not sure how this would play with the UI
Now we have 112 790 results when searching for "youth" https://dev.kohesio.eu/projects?keywords=youth&sort=Total-Budget-(descending)
Very strange, and even the map tab has many projects! to investigate...
@svili could you investigate this?
Update: https://kohesio.ec.europa.eu/en/projects?keywords=great now loads 50.000 results instead of 5.000, but it takes a long time. Not sure what the ideal solution is. @Dgojsic any preference?
@D063520 it seems we forgot to take out this temporary limit lift after all
@madewild I think a higher threshold is preferable, just so it allows for visitors to get a better picture. In addition, i don't think people will use these ultra-broad terms like "good" to actually search for projects, so the likelihood of them encountering many waiting periods is probably low.
In addition, I did some testing with "housing" a keyword, and the 31k or so observations basically appeared instantaneously Vs with "good" which takes a few seconds to load indeed. Curious to find out how the limit is related to search duration Vs other things like the breadth of additional keywords using semantic search for example.
so @D063520 @svili what's the status now?
Hi Max, I would keep it like this. Basically doing otherwise would transform the request from a transactional request to an analytical request. When you search in google for "great" he will also only give you the top results and not the results of all pages containing it
1) we keep like this, like most application do 2) we change it, but we are basically carrying out a lot of work for something that is not important for many users
I checked https://kohesio.ec.europa.eu/en/projects?keywords=housing and it's taking 20 seconds to load 30 794 results. @svili mentioned that the 50000 limit was a temporary threshold you forgot to remove in prod, what about that?
yes you are right, that should not have moved to prod! it was a change that someone from regio wanted for a particular use case. I fixed it in dev. It will move to staging this night ....
Indeed, the change was made because Roberto was indeed performing analysis using keywords on our end. If we have a lower limit in prod it is fine, however it would be useful if we have an option (even if just in dev) that allows us (and mostly just us, not the general user) to use a higher word limit in case we need to do analysis. In that case the response time also is not super important either, as it is just us using it.
The problem is that currently this is is a parameter that is hard coded in the sparql endpoint and not super easy to fix. We can open an issue for that. Basically we need to make a pull request to this repo https://github.com/eclipse-rdf4j/rdf4j that is taking care of the query parsing and optimisation
Closing this based on the feedback of Damir.
For instance: https://dev.kohesio.eu/projects?keywords=great
This is linked to the performance issues of the semantic search? Annoying because then the map is not representative of the overall situation...