basho / yokozuna

Riak + Solr
245 stars 76 forks source link

Yokozuna Map-Reduce Input Very Slow [JIRA: RIAK-1700] #310

Open wbrown opened 10 years ago

wbrown commented 10 years ago

I've gotten Yokozuna's mapred_search inputs working in my setup. However, it is extraordinarily slow.

Doing a regular Solr-style search() yields records in the rates of thousands a second. However, doing the same search streamed via a map-reduce gives me rates of at best dozens of second. An additional observation is that when I hit deep-paging performance issues in search(), I get these similar rates.

Is this an issue with how Yokozuna mapred_search feeds search results inputs, or is the function asking Solr for the entirety of the results and hitting the large result sets?

rzezeski commented 10 years ago

It's a combination of both. The Yokozuna map-reduce functionality was added to be complete with current Riak Search but it has not been benchmarked or turned (it has hardcoded page size of 10). The deep paging issues with Solr don't help either. Hopefully, starting at the end of this week, there will be a period dedicated to performance testing. This would be a good issue to look into.

sallespro commented 10 years ago

@wbrown : what do you mean when you say you've gotten Yokozuna's mapred_search inputs working in your setup ? can you please provide the key details on how to do it ?

@rzezeski : if this is answered https://github.com/basho/yokozuna/issues/319 , does this hard coding of page size would limit the result sets in 10 items batches ?

wbrown commented 10 years ago

@sallespro It's a Python setup, and the key is that I had to modify the call in the Python library from riak_search to yokozuna.

Also, regarding #319, yes -- the batch size would be set to 10, dramatically slowing things down.

@rzezeski I wrote my own adaptive algorithm for that in Python. I do an initial search of about 100 elements, to get a count -- and to return immediately, if the result set is less than 100. If it's larger than 100, I spin up a bunch of workers and size the search according to the final amount.

I've generally been able to get 4,000 keys a second via this method right until I hit the deep paging issue mentioned elsewhere.

rzezeski commented 10 years ago

Moved to 2.0.1 because this doesn't absolutely have to get done for 2.0.0.

DSomogyi commented 9 years ago

Comment for Jira.