basho / yokozuna

Riak + Solr
246 stars 76 forks source link

Add warning about the number of fields in default_schema.xml [JIRA: RIAK-3219] #719

Open Vorticity-Flux opened 7 years ago

Vorticity-Flux commented 7 years ago

We have hit an issue with Yokozuna/Solr that was quite hard to diagnose. I am creating this issue here as a heads up to anyone who may hit the same brick wall.

I propose to add a warning to default_schema.xml of something like: Be very careful with dynamicFields in the Solr schema. Try to limit number of fields in your index by several thousands (<50k). Extremely large number of fields in a single Solr core is known to degrade performance. Under any circumstances do not create new fields for each indexed document (and you can do this by accident!). Your Solr performance will degrade over time and you will not receive any warning or explanation.

Symptom: Solr is really really slow and struggles to keep up with KV even under a very light load. Riak version 2.2.0. Queries are fast, updates are slow. Index is very modest with 650k documents in it. Each document has about 60 fields. Total index size on disk ~450Mb. Solr is not under memory pressure. GC pauses, JVM heap and OS cache are all fine.

Frequent timeouts for Solr update requests are observed in the riak console.log yz_solrq_helper:send_solr_ops_for_entries:301 Updating a batch of Solr operations failed for index <<"index">> with error {error,{other,{error,req_timedout}}}.

solr.log is full of exceptions of two types: -java.lang.IllegalStateException: Committed -org.eclipse.jetty.io.EofException "Committed before 500" that indicate that riak is closing connections before request processing was complete.

Detailed investigation has shown that only AAE updates are slow and take up to 10 seconds(!) each. Regular updates are all under 5ms. AAE updates use deleteByQuery requests. deleteByQuery are slow even if triggered directly to Solr (not via riak).

yz_solr:partition_list(Core) call was taking 10 seconds for this core as well. Setting facet.method=enum resolves this particular symptom (this deserves a separate discussion).

Further investigations has shown that deleteByQuery requests are slow as they can't be performed concurrently with Solr commits (even the soft ones) and for some reason this Solr core was in a state of almost constant commit (with default commit settings). New searcher was taking 10 seconds to open. Further JVM profiling and tracing have shown that Solr is constantly building some huge trees in memory with FieldInfo structures. The problem became apparent when contents of *.fnm index files was examined. This core contained about 2M of unique field names. Solr and Lucene are not designed to work with data sets like that (at least not with Near Real Time index updates).

As a rule of thumb total size of *.fnm files in the index directory should be well under several Mb. In our case it was 140Mb.

As a short term fix we were able to stabilize this core by reducing the commit frequency, applying very heavy AAE throttling and setting the default facet.method to enum.

Solution: change the application logic and rebuild the index without using dynamic field names.

References: https://issues.apache.org/jira/browse/SOLR-10014 https://issues.apache.org/jira/browse/LUCENE-7648

fadushin commented 7 years ago

Interesting, because we did a ton of work in 2.2.0 to remove deleteByQuery. But we could only optimize the path where we had a previous object (normal writes), and as a result we could do delete by id, which is a lot faster.

Vorticity-Flux commented 7 years ago

Yes, I saw that. Removal of deleteByQuery from normal writes is a huge improvement in 2.2.0! I was so puzzled by the 2.2.0 release notes that I had to dig into yz source to understand that deleteByQuery does still happen in the AAE case (after all I saw it the logs).

AAE only seriously kicks on once Solr start to fall behind and update timeouts occur... So in our case heavy deleteByQuery AAE requests was just the last straw that 'killed' Solr that was already struggling with an insane index. It's kind of snowball effects, more writes time out - more AAE requests are waiting...

Index throttling helps, but maybe it would make sense to have separate queue/trotting parameters for AAE and normal Solr index requests as cost of one operation can be vastly different? Also normal operations should have priority over AAE (i.e. queue can be full for AAE but normal requests can still be added).

imaimai86 commented 5 years ago

Any upates on these?

russelldb commented 5 years ago

I don't know if anyone is actively maintaining yokozuna. Basho ceased trading mid-2017 and bet365 (ping @martincox) took over the ownership of assets. There have been 2 releases of riak since then, but none of the principal parties have done any work on yokozuna in that period. Maybe it was time yokozuna was given its own Org and handed over to the community?