basho / yokozuna

Riak + Solr
246 stars 76 forks source link

Solr Deep Paging Support [JIRA: RIAK-1701] #309

Open wbrown opened 10 years ago

wbrown commented 10 years ago

I've been using Yokozuna to search through and retrieve large result sets, but it breaks down at around the 400-500K record mark due to Solr's issues with pagination.

Researching the issue, I stumbled across mentions of an efficient deep-paging patch.

http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Are there any plans to integrate this into Yokozuna once this makes it into Solr release, or utilizing this patch independently?

rzezeski commented 10 years ago

Yes, the improved deep paging will be part of Solr 4.7. My hope is this will come out soon so that it can be integrated into Riak 2.0. If it does't then this support may have to wait a while longer. It's definitely on my radar and I very much would like it in for 2.0. I just can't promise anything at the moment.

wbrown commented 10 years ago

Thanks for the answer -- let me know if there's anything I can do to help out in this direction, as this is extremely relevant to my use case.

rzezeski commented 10 years ago

While working on upgrading to Solr 4.7.0 I discovered that the new cursor support and Yokozuna don't get along. In my benchmarks I observed both under and over counts. I believe this has to do with a combination of the cursor implementation and _yz_id.

The unique id is:

<type>_<bucket>_<key>_<logical_partition>[_<sibling>]

Every object will have N index replicas with 3 different <logical_partition> values. Given an object on partitions 4, 5, and 6, if the first page of the query ends on _4 but then hits a different query coverage plan then it might see the same object id but for partition _5. This is lexicographically later so the same object gets counted twice. This would explain over count but not under count. In my tests I was also sorting on score and I wonder if that could have something to do with it?

Even if I ignored the over/under issues I also found the cursor based pagination to be much slower for smaller result sets. I haven't yet tested larger result sets yet.

More time is needed to investigate the issues here. There may not be enough time to have it all sorted out before 2.0. My hope is to have Solr 4.7 in 2.0 but cursor based paging may remain broken for a while. Even if it is fixed the protocol buffers API will not support cursor-based paging because the it does not support the needed fields.

DSomogyi commented 9 years ago

Comment for Jira.

zeeshanlakhani commented 9 years ago

More benchmarking must be done.

zeeshanlakhani commented 8 years ago

Per @kesslerm and I's discussion, next steps would be to write a test over various query params and paging across many results and coverage plans.

suddenrushofsushi commented 8 years ago

Are there any plans to resurrect support for this, I too have a use-case for deep paging in Solr on top of Riak.

zeeshanlakhani commented 8 years ago

@suddenrushofsushi yep, we are working on it.

mitchellwrosen commented 6 years ago

Any updates on this?

rzezeski commented 6 years ago

Any updates on this?

If you didn't already hear about it, Basho went under. Luckily, all the Riak assets were purchased by Bet365 in late 2017. I believe there is work on a new Riak release but I'm not sure if anyone is putting any work into Yokozuna. I have been out of the loop for a long time, but my guess is not many people are inclined to maintain Yokozuna (for various reasons, all of which are moot). If I were in your shoes, I wouldn't hold my breath.