SolidLabResearch / Challenges

24 stars 0 forks source link

Performant search #48

Open NoelDeMartin opened 2 years ago

NoelDeMartin commented 2 years ago

Pitch

Searching through a user's data is a common use-case to be found in many apps. For example, if you have a list of TODOs you may want to search some with a specific tag, or including certain words.

One way to solve this is to keep a local copy of the documents and use that for searches (or a local index). This works well in many situations, and actually goes well with an offline-first approach. But it's very cumbersome to work with, among other things because a proper implementation of this needs to be kept up to date with any modifications that are done server-side (using websockets, for example).

There are also some situations where this offline-first approach is not feasible, or desirable. And in those situations, searching across the entire POD does not scale. One way to solve this would be to create indexes in the POD itself, but that's not interoperable because you can't rely on other applications updating these indexes when the data is modified. So in reality you have the same problem as with local indexes, but adding an additional layer of complexity.

Desired solution

It'd be ideal if there is a way to query an entire POD to get only filtered results. I'm not sure if this is feasible without adding some features to the protocol itself, though.

At the very least, if that isn't possible, it would be nice to have some examples of the better ways to go about it. Indicating the trade-offs of different solutions.

Acceptance criteria

Given that the following dataset with more than 1000 movies is in a POD: top-imdb-movies.zip

I would like to see a proof of concept with an app that searches content across the entire POD, and is performant enough to have a decent UX. For example, searching text inside of the schema:name property which is the movie title.

We can define "decent UX" as a couple of seconds for the initial search, and around 200ms for subsequent refinements.

For bonus points, a more realistic search would normalize the text to search for. For example, a movie called "Fantastic Mr. Fox" should be findable by typing "fantastic mr fox".

Pointers

rubensworks commented 2 years ago

Using SPARQL would also be a solution, but I'm not aware of any way to run a query on an entire POD, instead of a single document.

Another pointer here is Comunica's WIP link traversal functionality: https://comunica.dev/research/link_traversal/

I would like to see a proof of concept with an app that searches content across an entire POD, and is performant enough to have a decent UX. For example, searching on any rdfs:label triple.

This can be part of the solution: https://comunica.github.io/comunica-feature-link-traversal-web-clients/builds/solid-default/#transientDatasources=https%3A%2F%2Fruben.verborgh.org%2Fprofile%2F&query=PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20schema%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fsubject%20rdfs%3Alabel%20%22Ruben%20Verborgh%22%40en.%0A%7D Finds a resource by rdfs:label across 310 documents in less than a second.

RubenVerborgh commented 2 years ago

Cool, would need to be applied to a use case so we can make it concrete.

NoelDeMartin commented 2 years ago

What do I need to specify to make this more concrete? Would it be fine if I provide an example dataset and a use-case for how to search it? (e.g. I could provide a file with N turtle files describing movies and say that they should be searched by name).

The part I'm most hesitant about is the acceptance criteria, I'm not sure how to codify "decent UX" into specific requirements. I could come up with an arbitrary number of seconds, but I'm not sure that'd be useful (and things like network latency could affect the outcome). Maybe it's enough just to say that the dataset has to be searchable by a given field, and the challenge is to make it as fast as possible?

RubenVerborgh commented 2 years ago

What do I need to specify to make this more concrete? Would it be fine if I provide an example dataset and a use-case for how to search it?

Yes, that could be a good case indeed.

Maybe it's enough just to say that the dataset has to be searchable by a given field, and the challenge is to make it as fast as possible?

I think specifying "a couple of seconds" for the initial search, and perhaps "around 200ms" for subsequent refinements, would be good and specific goals.

pheyvaer commented 1 year ago

@NoelDeMartin Can you incorporate the necessary changes mentioned by Ruben? Thanks!

NoelDeMartin commented 1 year ago

@pheyvaer Yes I had this on my TODO list but I was focusing on other things. I'll get back to it and try to do it sooner, thanks for the ping.

NoelDeMartin commented 1 year ago

Ok, I have updated the issue description including a dataset and specifying the acceptance criteria, let me know if something else is missing.

RubenVerborgh commented 1 year ago

Thanks, @NoelDeMartin!

NoelDeMartin commented 1 year ago

Hey @rubensworks, I'm following up on this because I've recently seen more interest in the topic.

Another pointer here is Comunica's WIP link traversal functionality: https://comunica.dev/research/link_traversal/

This can be part of the solution: https://comunica.github.io/comunica-feature-link-traversal-web-clients/builds/solid-default/#transientDatasources=https%3A%2F%2Fruben.verborgh.org%2Fprofile%2F&query=PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20schema%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fsubject%20rdfs%3Alabel%20%22Ruben%20Verborgh%22%40en.%0A%7D Finds a resource by rdfs:label across 310 documents in less than a second.

From a developer experience point of view, I think this is great. Because it does solve the issue. You write a SPARQL query, and it "just works". However, I have some doubts about UX and performance.

When it comes to UX, it's true you get a result in less than a second (depends on the network though). But the search isn't finished until 26 seconds. When I search something, I want to see the most relevant results first, not the ones that happen to appear in the graph traversal first. So I think that's a UX problem, because if I'm building an app I'll probably wait until the complete search is finished before sorting and presenting the results. Sure, I could do that as the results come in, but I think that's a very bad and annoying UX. I'm sure you've found that situation where you're going to click a link in a list but it suddenly changes under you and you click something else by mistake.

Also, when it comes to performance, I looked at the network tab on the example you shared and it executed 632 requests, with 27.92MB data downloaded. Which I guess is just what happens for a query that can potentially traverse the entire POD. So that means that you have to either optimize the query or have an awful performance and data consumption. Which kind of defeats the purpose to begin with, you could just do the requests by hand.

I understand why this happens, though, given the limitations of the protocol in its current form. That's why I mentioned in my initial Pitch that I'm not sure this can be improved without adding some features in the protocol. In the end, this link traversal approach is good to simplify writing the code, but it doesn't really solve the issue at its core. That's why for my apps I'm still opting for an offline-first approach where I download everything and then do searches locally. But not all use-cases can follow this approach.

rubensworks commented 1 year ago

Hi @NoelDeMartin, thanks for your input! I follow all of the issues you raise, and these correspond to some of the future work we have planned in this area.

In summary: