byzhang / terrastore

Automatically exported from code.google.com/p/terrastore
Other
0 stars 0 forks source link

Value-based Comparator #161

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
For applications that need to sort documents a variety of ways, consider 
implementing a ValueComparator interface.  It would be functionally similar to 
the existing Comparator, except it would evaluate values rather than keys.

Presently, in order to provide a sort-and-page user interface, I am pulling the 
full set of documents, sorting them on the client side, and then selecting the 
subset composing the requested page.

A ValueComparator would allow the servers to process a sort operation to be 
followed by a ranged Get.  This would reduce the network utilization of my 
sort-and-page UI.

Original issue reported on code.google.com by teonanac...@gmail.com on 21 Feb 2011 at 1:25

GoogleCodeExporter commented 9 years ago
I was thinking about this, and there fundamentally are two ways to implement it:

1) Provide a ValueComparator to be used *instead* of the key comparator for 
range selection: by doing so, we'd need to broadcast the range operation to 
*all* nodes, rather than a subset of nodes as we do now, because no node have 
knowledge of all data in the cluster.

2) Provide a ValueComparator to be used *after* the key comparator: that is, 
the new comparator would sort only those documents retrieved by the key-based 
range query, and as a consequence, the comparator would only affect sorting, 
not range querying.

That said, I think #1 would cause too much network traffic and probably 
unnecessary CPU work by all those nodes queried without actually having any 
document of interest. Indeed, #1 should be properly implemented with 
distributed indexes, but that's a completely different issue and right now we 
have ElasticSearch integration for that.

So I think #2 should be the way to go: would that work for you? Do you need any 
clarifications?

Other thoughts from the committers crew?

Original comment by sergio.b...@gmail.com on 21 Feb 2011 at 6:20

GoogleCodeExporter commented 9 years ago
My use case is to sort a list of documents by an essentially arbitrary 
attribute and then limit the results by an offset and count (user interface 
sorting and "paging").  If I understand your proposal correctly, I suspect that 
option #2 does not address my need because I would first need to constrain the 
space of documents to a key range.

That said, I'm interested in understanding option #2 better.  Can you explain 
it with a potential use case?

Original comment by teonanac...@gmail.com on 21 Feb 2011 at 6:31

GoogleCodeExporter commented 9 years ago
> I suspect that option #2 does not address my need because I would first need 
to constrain the space of documents to a key range.

Correct.

> That said, I'm interested in understanding option #2 better.  Can you explain 
it with a potential use case?

I was thinking your use case were to query for documents with a given id/key 
range (and optionally a predicate) and then sort by a given attribute.
In other words, the fundamental difference between the two approaches is what 
you query for: if you can query for the id, filter by predicate, and then do 
the sorting, option #2 fits it; otherwise, if you cannot constrain your key 
range, #2 doesn't fit.
In my experience, many use cases can actually put boundaries to key ranges by 
just arranging the algorithm as follows:
1) Query from startKey=1 with limit=10 and predicate=X.
2) Optionally sort (how doesn't really matter here).
3) Take the last returned key and repeat by using it as new startKey.
The trick is to advance the start key and use the predicate to select the 
attribute you're looking for (and possibly the one you want to sort).

Does that help?

Original comment by sergio.b...@gmail.com on 21 Feb 2011 at 6:45

GoogleCodeExporter commented 9 years ago
I'd just like to share the notion that this sounds like a completely different 
operation than range queries, which I assume that this is a proposed extension 
of?
That holds true especially in the absence of a key comparator in a query. 

As you already hinted at, Sergio, this is more of a search operation and for it 
to make sense it would have to use an existing index.

Original comment by johansso...@gmail.com on 21 Feb 2011 at 9:00

GoogleCodeExporter commented 9 years ago
> I'd just like to share the notion that this sounds like a completely 
different operation than range queries

It depends on how the user actually sees it, that's why I asked: is it a search 
by attribute, or a filtered range query sorted on the attribute?

Surely, a distributed index would probably be the best solution to the problem, 
but that's a whole different story :) 

Original comment by sergio.b...@gmail.com on 21 Feb 2011 at 9:08