collective / collective.solr

Solr search engine integration for Plone
https://pypi.org/project/collective.solr/
21 stars 46 forks source link

Replace default responses from XML by JSON #348

Open gforcada opened 1 year ago

gforcada commented 1 year ago

For quite a few major releases, Solr allows you to specify in which format you want to receive the results.

collective.solr always asks for XML and it has a quite involved parser.

At work we noticed that we get notifications of slow Solr requests, and actually looking into it, is not that Solr (the server) is slow on sending the response, but rather that collective.solr takes quite some time to process the received XML response, and the notification that we get is fired after the response is processed.

Getting JSON responses might be much more straightforward to process or even the python format, that returns a dictionary-like. Most probably the JSON version is faster to parse, we should get numbers... 🤷🏾

Would it be an option to either change it completely, or allow to specify/configure in which format one wants to get the responses?

tisto commented 1 year ago

@gforcada I'd be open to refactoring c.solr to use JSON instead of XML in general. I went a bit in that direction and I'd like to share my findings.

Because Solr queries were too slow (seconds instead of milliseconds) in c.solr I started at some point to write an endpoint that does a raw Solr query and that returns the raw Solr results:

https://github.com/kitconcept/kitconcept.solr/blob/0a12c7116a0041609f7b5d78d9f4cb90924bde6c/src/kitconcept/solr/services/solr.py#L71

This came out of a longer discussion with the 4tw folks, who went in a similar direction with ftw.solr.

When I compared the performance of this raw Solr approach I figured out that as soon as I start converting the results (in Python), things became slow. I did not investigate this further and I did not do any performance measurements.

I was after a raw Solr query anyways because I was tired of not being able to use Solr directly and relying on abstraction layers. Therefore my gut feeling would be that moving from XML to JSON won't give us a significant boost. Though, I could be mistaken.

At kitconcept we are still evaluating different approaches, this is why we created kitconcept.solr. collective.solr does a lot more and a raw Solr query does not really fit to the expectations people might have about collective.solr.

In any case. If you gain some profiling data on this topic I'd love to see it.

davisagli commented 1 year ago

@tisto There are a number of different libraries available for deserializing JSON in Python, and it is probably worth a bit of investigation to profile the different options. The naive approach in Python of creating an empty list and then adding items to it as you parse the JSON is bound to be slow, because Python has to keep reallocating the memory available for the list.

In the case of the view you linked, it looks like it is pretty close to not needing to deserialize the raw results in Python at all. If it could be avoided (and just embed the serialized string from solr in the response) then that would save quite a bit of effort.

gforcada commented 1 year ago

What I remember, though it is a few years ago, is that analyzing where the time was spent processing the XML responses from Solr was on converting a string to a DateTime object.

One option would be to make everything lazy (though that probably complicates things 😅 ) so parsing a response would be converting data to JSON and giving a mock interface that only when a brain is accessed, then we convert the JSON data to a proper brain...

tisto commented 1 year ago

@davisagli yeah. That was the whole idea of this "raw Solr" approach. I was shocked at how much slower things became when I tried to mess around with the response. The Solr response format is very well documented and can be transformed into whatever is necessary on the front end. Personally, I think this is the way to go in the future.

@gforcada thanks for sharing this finding. This makes a lot of sense. I guess in the end it all depends on if you are using Python or JavaScript to render your results. In a "Classic" environment it makes sense to optimize in Python, in a JS-frontend environment I think it makes lots of sense to let go of the transformation in the backend and just relies on the frontend (where you can lazy load or render things as well).

Of course, there is nothing that prevents you from using the REST-API-based approach in Classic as well. :)