Kinetic / kinetic-py

Kinetic Python Library
http://seagate.github.io/kinetic-py
22 stars 10 forks source link

getRange can't handle None as end key. #10

Closed toolslive closed 10 years ago

toolslive commented 10 years ago
x = client.getRange("", None, True, True)

results in an exception due to parameter validation (the irony ;))

[https://github.com/Seagate/kinetic-py/blob/develop/kinetic/operations.py#L177]

You need to be able to specify an open range (or do I really have to use "\0xff"*4096 )?

icorderi commented 10 years ago

Hey @toolslive, it's an interesting scenario. Having clients hard code the end of range multiplication wouldn't be nice.

What do you think about the following behavior:

x = client.getRange() 
x = client.getRange(None, 'foo') 
x = client.getRange('foo', None) 

getKeyRange would provide the same contract

Although to be honest, I find myself doing prefix queries a lot more than open ended ranges.

toolslive commented 10 years ago

You don't need to change the startKey semantics since there exists a string that is smaller than or equal to (<=) all other strings: "" (it's sometimes called epsilon). The endKey however needs to have some encoding to represent the string that is larger than or equal to all other strings (it's sometimes called omega).

Another problem is the semantics of the inclusive flags when one of these parameters is None. Do you include the first key you find or not?

If people are going to do some consistent hashing scheme over a set of drives, they will have to somehow be able to denote the interval [a,omega[ .

I do have a question about stability of the API. Is the python API considered stable or are you still in a position to modify things ? (Also, how do they solve this particular problem in the C++ API fe?)

icorderi commented 10 years ago

The None on start is for completeness, it is essentially the same as '' with inclusive=True. Semantically I think whenever None is present, given that it means "from the very first key" or "to the very last key", it is theoretically exclusive but practically inclusive given that the library will be translating it to '\xFF'*MAX_KEY_SIZE.

We can have an entire side conversation on key schemas and how to avoid that [a, omega). Sufficient to say, your kinetic key should not be just the hash.

The library is converging into stability, a few things will move around before 1.0.0 for sure.

I haven't used the C++ library, you might want to ask them for specifics. My guess is that you have to calculate the real value of omega, and is the same things for the java library.

Consider adding /data/ at the beginning of your data blocks.

jphughes commented 10 years ago

Let me chime in...

The use of '' as epsilon for the last possible key is a convienence. Imagine an epsilon equal '' as a special case where '' is exclusive with an implicit wrap. Does that make sense?

Yes, the API is stabilizing. We are eager to hear of things you would want to improve.

toolslive commented 10 years ago

I'm not sure where the API is defined, but I took a look at the protocol buffer definition: [https://github.com/Seagate/kinetic-protocol/blob/master/kinetic.proto#L250-L251]

So at least on protocol level, both startKey and endKey are optional, and thus I would expect the python API to allow None as a valid parameter and not put that parameter in the protocol buffer message sent to the drive/simulator. The semantics would then completely determined by the drive/simulator and not a superimposed idiosyncrasy by the python API.

Regarding the suggestion to add /data/ . Yes: that's a workaround, practically making /data0 a key that's larger than anything in range.

Regarding the ''. I don't think you're doing yourself any favours equating '' and None.

markw- commented 10 years ago

Hi,

I wrote myself a one line utility function to get all the keys in a namespace as it seemed a useful thing to be able to do. Calling getNext repeatedly would not be good, even if there was a "getFirst" method.

I can imagine that being able to get all valid keys from a namespace is going to be a useful thing to do for object deduplication in some Kinetic implementations. It's the first thing to springs to mind for me - object dedupe is useful, especially as block dedupe is a thing of the past with object disks.

Just my 2p.

Cheers Mark

jphughes commented 10 years ago

Hi Mark:

I agree that calling getNext is bad. Calling getRange (which returns up to 200 keys between some start (such as "") and some end. Can you share you code?

About the comment that the fields in the .proto definition are marked as "optional". This is regarding the compilation of the .proto and not the fact that a field is always considered optional. Some fields, like HMAC are marked "optional" but if they are missing, the code that processes the messages will return an error. The comments that are in the .proto file are expected to make this clear.

Sometimes there are optional fields, but in that case it needs to be described.

For more information about using "optional" on google protocol buffers, please see the section "Specifying Field Rules" at https://developers.google.com/protocol-buffers/docs/proto

for more information about the use of the fields in the .proto file, please see the read me at https://github.com/Seagate/kinetic-protocol/blob/master/README.md

I hope this helps.

markw- commented 10 years ago

Hi Jim,

Sure - but it might not be doing what I think it should (and usually I find code snippits like this are redundant when I've learned more about the system I'm investigating).

from Kinetic import common

def getAllKeys(disk):

return(disk.getKeyRange(startKey=b'x00',startKeyInclusive=True,endKey=b'xff'*common.MAX_KEY_SIZE,endKeyInclusive=True))

I don't think it was me that mentioned things like HMAC are marked optional - last week I read the protobuf doco from top to bottom (hard but interesting) but I still have not had time to dig into the Kinetic .proto file itself.

Cheers

Mark

On 2014-07-09 21:41, James Hughes wrote:

Hi Mark:

I agree that calling getNext is bad. Calling getRange (which returns up to 200 keys between some start (such as "") and some end. Can you share you code?

About the comment that the fields in the .proto definition are marked as "optional". This is regarding the compilation of the .proto and not the fact that a field is always considered optional. Some fields, like HMAC are marked "optional" but if they are missing, the code that processes the messages will return an error. The comments that are in the .proto file are expected to make this clear.

Sometimes there are optional fields, but in that case it needs to be described.

For more information about using "optional" on google protocol buffers, please see the section "Specifying Field Rules" at https://developers.google.com/protocol-buffers/docs/proto [1]

for more information about the use of the fields in the .proto file, please see the read me at https://github.com/Seagate/kinetic-protocol/blob/master/README.md [2]

I hope this helps.

Reply to this email directly or view it on GitHub [3].

Links:

[1] https://developers.google.com/protocol-buffers/docs/proto [2] https://github.com/Seagate/kinetic-protocol/blob/master/README.md [3] https://github.com/Seagate/kinetic-py/issues/10#issuecomment-48524192

toolslive commented 10 years ago

The use case most people need is paging. So they will call getRange, list the items, and then would like to continue for the next batch. Most storage systems will actually return a list of items and some kind of continuation key, that can be used as the next from key. So they can live with it that the server side decides how big the batch is, but they would like to know what the next query is they need to do.

icorderi commented 10 years ago

Added a walk example on the examples repo.

@toolslive the continuation key is essentially the last key received on the previous batch, make sure the next call to getKeyRange has an exclusive start so you don't get that key twice. You can find this at lines 46 to 47 on the example.

I'm considering moving this few lines to the client code, which will essentially allow you to pass None as start/end keys to the underlaying getKeyRange.

icorderi commented 10 years ago

Added support for None keys. Commit 2c549b5bd3d435a822fd06d283643e2a5ba33940