TYPO3-Solr / solr-typo3-plugin

An Apache Solr filter plugin to support TYPO3 access restrictions.
http://www.typo3-solr.com
Apache License 2.0
7 stars 5 forks source link

Change typo3access into a postFilter #9

Closed ArbitraryCritter closed 7 years ago

ArbitraryCritter commented 7 years ago

This pull request shouldn't really be taken as a request to commit, atleast not in its current state, it's solr-typo3-plugin-1.7.1-SNAPSHOT.zip more to make everyone aware of this alternative approach to the typo3access module and its advantages.

This changeset is not ready for merge, in particular I haven't changed the test cases and many of them currently fail. But I don't want to put in the effort until I've had a bit of feedback from you about how you see the changes.

First, the background: I work for Systime (along with Thomas Hohn) and we've been using this plugin in ways that I believe are slightly different from your general case. I believe this plugin is generally used with very small indexes (cores) and mostly only for single installations. We've been putting all of our 300+ installations into a single core, with the result that our danish language core is now over 1GB in size, which has caused query times to jump to ~200-300ms when the cache is warm and much higher when the cache is cold. While not useless, this precludes us from some of our future projects, including various types of auto completion and as-you-type search. I started profiling solr and found that the vast majority of time was used in typo3access.

I found that the plugin basically iterates over every single entry in the index (its docvalues to be preceise) for every search which caused search performance to plummet and cpu usage to go through the roof. Looking at SOLRs design it became pretty clear what's going on, most SOLR filter queries, including typo3access are run in parallel with the query itself and when all elements have returned they are essentially AND'ed together to form the final result, so basically the slowest query always becomes the bottleneck, in our case typo3access was the slowest by a factor of 30+, which will only become worse as our index size grows.

The change: To a degree this is to be expected, we will always have to look at the docvalues, because that's how the module works obviously, but we don't necessarily have to do it for the entire index for every search. So what I've done is change the typo3access module into a SOLR postFilter. Basically as I mentioned, most SOLR filters and queries are run in parallel and then filtered based on the results, postFilters on the other hand are run after other queries and filter queries have run and then only have to process what they return. For most filter queries this is a bad approach, since they're generally quick, for typo3access this is usually great, since it causes us to process vastly less data.

For our case it has reduced average query times from to less than 10 ms on average. The worst case is still as bad as before (250 ms for a "*" search across all installations). Basically this changes the typo3access module execution time from depending on the size of the index to depending on the size of the processed data, which in most cases is much better.

Now, for some usecases this could be a regression. For instance, if you have an under 50 mb index the typo3access execution time might already be so low, that it would completel with the other filters... That said, I suspect the postFilter might still beat it slightly.

Implementation details: Basically to be a postFilter a module should:

1: Disable caching. 2: Set its cost to 100 or more.

(See Lucene API docs for details) https://lucene.apache.org/solr/6_3_0/solr-core/org/apache/solr/search/PostFilter.html

Also, a postFilter essentially works by overriding its DelegatingCollector as implemented. Other than that, you should find that the code actually hasn't changed much.

I have implemented this for 1.3 (solr 4) and 1.7 (solr 6), have attached binary builds for both (built with JDK 8). We're currently running the 1.3 implementation in production with the mentioned results. The 1.7 implementation has only had minimal testing and its Multivalue schema handling is completely untested.

solr-typo3-plugin-1.3.0.zip solr-typo3-plugin-1.7.1-SNAPSHOT.zip

irnnr commented 7 years ago

I believe I looked at post filters earlier when trying to update the plugin for newer Solr versions. I never made it happen though, but I think it'd be the right approach.

As long, of course, as the tests work.

ArbitraryCritter commented 7 years ago

Thanks for the input @irnnr. I've just committed javadoc and style changes that should address the issues you've mentioned.

Starting work on the test cases.

ArbitraryCritter commented 7 years ago

Fixed a pretty silly bug with Multivalue handling, that the test cases caught (any anyone who had tried to run it, would have quickly discovered too), that brings us down to 12 failing test cases, which are basically all the AcesssFilter testcases.

The reason they're failing is that PostFilters are a Solr feature but the TestCases only test against Lucene... So that won't work. To fix this, I need to rewrite the testcases to test against Solr.

Looking at the efforts required to rewrite the current stuff to be based on Solr, I think I might prefer to rewrite the testcases to use the SOLR developers own Junit extensions SolrTestCaseJ4.

https://lucene.apache.org/solr/6_3_0/solr-test-framework/org/apache/solr/SolrTestCaseJ4.html

Any comments? Advice? Complaints?... Death threats?

Addendum: I see you're already using SolrTestCaseJ4 in the AccessFilterQParser tests, so I guess you're fine with that :)

timohund commented 7 years ago

@RasmusWernerLarsen nice to see that you are spendig much effort on this. Do you think this is a problem in the 1.7.0 version when it's not used as a post filter?

For me using SolrTestCaseJ4 is fine.

ArbitraryCritter commented 7 years ago

@timohund well, the first thing I did when looking into the performance behavior of the 1.3 (Solr 4.10) branch, was to try it out with 1.7 (6.3), which turned out to have pretty much the same performance characteristics. I didn't try changing the access schema to a Multivalue field, but pretty much everything else.

It's possible the current (non PostFilter) approach could be improved with some from of caching, but I'm unsure how to go about it.

In any case, the PostFilter approach is designed for very "costly" filter queries, which seems to fit typo3access quite well.

ArbitraryCritter commented 7 years ago

Test cases should be good now...

I've removed 4 of them, that as far as I could see were implemented in both SolrTestCaseJ4 and the pure Lucene test cases. I sort of liked the Lucene testcases, they were nicely low-level, but they won't work with PostFilter.

I converted the rootline testcases into SolrTestCaseJ4 tests, they should be pretty much equivalent.

irnnr commented 7 years ago

sounds good, I'll take a look later

timohund commented 7 years ago

@irnnr Did you had a look on this? How could we proceed with this pr?

irnnr commented 7 years ago

@timohund once we merge this I think we should bump the version number to at least 1.8 or even 2.0. (Probably 2.0)

ArbitraryCritter commented 7 years ago

@irnnr - This should add the missing javadoc return statements, thanks for the review :)

irnnr commented 7 years ago

yeah, saw that :)

ArbitraryCritter commented 7 years ago

What are your thoughts on merging this? If you have any worries about it, might as well look into addressing that, while everything is still fresh in my mind.

timohund commented 7 years ago

@RasmusWernerLarsen Sorry for the delay. I'll try to try to have a look on this soon and i hope that we can integrate it into 6.1.

timohund commented 7 years ago

@RasmusWernerLarsen @irnnr Since ingo has approved this i'll merge this and as follow up i'll set the version to 2.0.0 goal would be to make it available in EXT:solr 6.1.0

@RasmusWernerLarsen if you want to help me, you can send me the required configuration changes in the solrconfig.

ArbitraryCritter commented 7 years ago

@timohund - There are no configuration changes required, just replace the module jar and it should work.

timohund commented 7 years ago

@RasmusWernerLarsen Ah perfect thx!