MTG / freesound

The Freesound website
https://freesound.org
GNU Affero General Public License v3.0
312 stars 87 forks source link

Upgrade to pysolr #1414

Closed alastair closed 1 year ago

alastair commented 4 years ago

We currently have a custom solr library for search. In order for future maintainability and support in newer versions of solr we should replace this with pysolr: https://github.com/django-haystack/pysolr

We need to check if there are different ways of doing the queries that we have - e.g. dismax, facets, highlighting, pagination. Perhaps we can have a light wrapper around pysolr to make our common functionality easier.

xavierfav commented 3 years ago

I've been checking how feasible it would be to use pysolr. I managed to reproduce search queries quite easily. The only thing that I did not try, and don't really understand for now is the highlighthing.

ffont commented 3 years ago

Hi, highlighting is used only in forum search, not in sounds search. It is used to highlight words that match the query terms, eg: https://freesound.org/forum/forums-search/?forum=&advanced_search=0&q=test&dt_from=&dt_to=-

I think the steps to follow to adopt pysolr (and later upgrade solr) should be the following:

  1. Add a new wrapper to our custom solr library. We should try to keep this wrapper as generic as possible, I'd even avoid mentions to solr itself or very solr-specific concepts. We might want to call this SearchEngine. This SearchEngine should be capable of using different "backends" and will call methods on these. A "backend" could be a subclass of a SearchEngineBackend class which implements some abstract methods from the parent. These methods should mainly be "search", "add_to_index", "remove_from_index" (but there are other things that also use solr like the tagcloud I think so we should think about this well).
  2. Implement a "backend" for SearchEngine which uses our current custom solr library (Solr45CustomBackend?). At this point, we can test that results obtained using the new wrapper are exactly the same as before making any code changes.
  3. Implement a new "backend" for SearchEngine which uses pysolr, but still with solr 4.5 (Solr45PysolrBackend). Test that results with Solr45CustomBackend and Solr45PysolrBackend are exactly the same.
  4. Setup a new solr instance based on the latest version of solr which has no backwards incompatible changes with respect to 4.5. Implement a backend for that version of solr (eg Solr5Backend). Test that results are exactly the same.
  5. Setup a new solr instance based on solr 8 and implement a backend for it (eg Solr8Backend).

These steps will take time, but that is fine. I'd suggest do 1-3 and then merge and deploy. Then 4. should be also relatively easy and I guess 5 will be more complex because we'll need to decide how to address incompatible changes if any.

I wonder if before starting any of this we should try to merge some of the changes in the clustering branch because I think there are significant changes in the search code...

ffont commented 3 years ago

Another radical idea I just had.... re-implement forum search using Postgres instead of Solr so in solr we only have the sound index and don't need to worry about other stuff. I think it should be doable because Postgres supports text search but I have no idea how complicated it should be. If we did that we could get rid of some code we have to synchronise solr forum index with actual forum contents.

xavierfav commented 3 years ago

Thanks for detailing the steps. It makes sense!

For replacing the solr with postgres for the forum, I don't know right now, since I have not looked much into the forum functionalities. Also, if the highlighting is something that is used for the forum, we will have to re-implement somehow I guess.

ffont commented 3 years ago

Also, if the highlighting is something that is used for the forum, we will have to re-implement somehow I guess.

Yes, I'm pretty sure it is doable and easy with postgres. However if highlighting is a problem we could consider not doing it anymore.

xavierfav commented 3 years ago

There are things in the search views that are related to how we process the query that is related to Solr, and it is currently defined in different classes implemented in the utils/search/solr.py file. I am not sure we can really abstract that part, and we will have to keep some things in the view file related to the query object manipulation. Shall I in any case have these methods in the wrapper instead of the solr.py file? Since things related to the query will be used for both our first "backends" (without and with pysolr). Actually, it is hard to think it as different backend, cause it is actually the same thing, pysolr just gives us some "easier" ways to send requests to Solr. Maybe it is really a better idea to merge things from the clustering branch before doing these changes, as you already suggested... We can maybe continue this dicussion somewhere else.

ffont commented 3 years ago

Sometimes the level of abstraction will be difficult. I guess things could be discussed on a case-by-case basis so you might want to mention an example here. But in general I'd try to think of it as that solr-related things are only in the backends. No problem if two solr backends are very similar because these are there only to test. We'll eventually have one single backend in use. If in the future for example we consider moving away from solr, then we'd need to implement a new backend, and that would certianly be very different than solr's one.

About merging clustering code yes I think it should be done. I think the steps would be:

  1. Merge master to clustering branch -> there will be conflicts I think, we might need to look at this carefully. If you want we could go through that together tomorrow at uni.
  2. Make necessary changes in clustering branch to freesound works normally if no clustering services are available (so we can continue working on freesound while we don't have clustering working on production)
  3. Merge clustering PR with master
xavierfav commented 3 years ago

Alright. It just feels that I will be doing a wrapper inside a wrapper for the search engine and backends. Anyway, I will do something basic and then we can discuss on it.

For the clustering, I will work on it now, and will repport to you at the Uni tomorrow what is the state.

ffont commented 1 year ago

I'm closing this issue as it was addressed with recent changes on search engine code and solr verisons.