cursor 'goto' functions return garbage for buffer based keys.

russtee commented 7 years ago

On testing the new key as buffer functionally I've noticed that in the case of a buffer key the cursor 'goto' functions return a garbage string when I was expecting a buffer.

Some test results showing this behaviour:

test putString
putString: k='a' (string), v='hello' (string)
putString: k='b' (object), v='goodbye' (string)
putString: k='c' (string), v='sayonara' (string)

test getString
getString: k='a' (string), v='hello' (string)
getString: k='b' (object), v='goodbye' (string)
getString: k='c' (string), v='sayonara' (string)

test cursor
goToFirst returned 'a' (string), getCurrentString returned 'hello' (string)
goToNext returned '杢漀漀搀戀礀攀' (string), getCurrentString returned 'goodbye' (string)
goToNext returned 'c' (string), getCurrentString returned 'sayonara' (string)

and the test code that produced them:

var lmdb = require('..');
var env = new lmdb.Env();
var fs = require('fs');

if (!fs.existsSync('test')) {
    fs.mkdirSync('test');
}
env.open({path: 'test', mapSize: 1024*1024});
var dbi = env.openDbi({name: "test", create: true});

function debugVar(v) {
    return "'" + v + "' (" + typeof v + ")";
}

const kvs = [
    {k: "a", v: "hello"},
    {k:  new Buffer("b"), v: "goodbye"},
    {k: 'c', v: "sayonara"},
];

console.log("test putString");
var t1 = env.beginTxn();
kvs.forEach(o => {
    console.log("putString: k="+debugVar(o.k)+", v="+debugVar(o.v));
    t1.putString(dbi,o.k,o.v);
})
t1.commit();

console.log("\ntest getString");
var t2 = env.beginTxn();
kvs.forEach(o => {
    console.log("getString: k="+debugVar(o.k)+", v="+debugVar(t2.getString(dbi,o.k)));
})
t2.commit();

console.log("\ntest cursor");
var t3 = env.beginTxn();
var cursor = new lmdb.Cursor(t3,dbi);

console.log("goToFirst returned "+debugVar( cursor.goToFirst() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));

cursor.close();
t3.commit();
dbi.close();
env.close();

Venemo commented 7 years ago

Looks like @braydonf has forgotten this use case.

braydonf commented 7 years ago

Perhaps there could be a goToNextBinary, and etc. methods for getting the key as a buffer if only the key is needed. Otherwise getCurrentBinary will give back the key as a buffer.

russtee commented 7 years ago

This still seems broken in the latest release. Any thoughts on the best resolution?

Separate get functions for binary keys could work albeit adding to complexity of the library. Perhaps an optional parameter for each goto function to return the key in binary form would suffice?

Either way it seems that mixing key types within a database will not work but as long as this is understood I don't see any major issues.

ousado commented 7 years ago

A scenario where one would want heterogeneous types of keys within the same database seems highly unlikely, hence the best option would be to associate the key type with the Dbi - as it's already done in case of keyIsUint32. An option keyType with constants for String, Binary, UInt32 (and maybe more) should be passed to openDbi and always used. That requires to add a field with that information to the wrapper, but since it's a relatively long-lived object, there's minimal overhead.

Venemo commented 7 years ago

@ousado Indeed, that sounds good. I'll think about it and get back to you guys next week about this.

da77a commented 7 years ago

It isn't "merely" a "garbage" string. I found this problem when using a database where the keys are strings, but they are not 0 terminated (obviously, these strings were not written by node-lmdb). That node-lmdb expected (and inserted) 0 terminators on keys stored in the database was a surprise. A bigger surprise was that using a cursor on a database containing such non-zero terminated strings would occasionally generate a bus error.... Because the returned key string is constructed using valToString which ignores mv_size. Regardless of any views regarding the utility (or lack of) of "heterogeneous types of keys" node-lmdb should at least be defensive (not exhibit undefined behaviour/buffer overrun) when/if a key contains unexpected data. A simple/reasonable fix is to always read keys (or values!) using mv_size as the length and simply trim any trailing 0s if returning it as a string.

Venemo commented 7 years ago

@da77a Yeah, that sounds reasonable!

da77a commented 7 years ago

On second thoughts - not reasonable enough. Because it is a key it's dangerous/incorrect to hide the distinction between 'a' and 'a\u0000' as (actual, in lmdb) keys. So on second thoughts node-lmdb "keys are strings" API needs to error if it reads a key without terminator. Code that needs to handle such keys must use the "keys are binary" API . As to how that should work - the simple fix of adding binary key returning methods seems fine. The idea of extending types takes on more responsibility in node-lmdb than seems to have been the intent. LMDB itself only understands 2 key types. So should a wrapper. The fact string was introduced as an additional type is understandable pragmatically but might actually have been a mistake, it can't be undone but provisions to add yet more key types don't seem necessary. Any user defined key types can be written on top of the binary / Buffers API.

ousado commented 7 years ago

I don't agree with those conclusions. There's no need for a terminator - the only correct way to reconstruct a string is by using the original length, NaN certainly has an API for that, and it should be used. Relying on the presence of a terminator is a bug, and an extremely nasty one at that. Further \u0000 is a valid unicode character that can be part of Strings.

Regarding support for additional types of keys, I see no problem with adding some convenience for the user, as long as it's done correctly and optional. Along those lines one could also consider adding an option for passing a pair of key encoding/decoding functions to openDbi which take care of the transformation of keys in a user-defined way.

da77a commented 7 years ago

@ousado , you state you don't agree but I don't follow your argument. I've assumed familiarity with the lmdb database and node-lmdb code not just what is exposed via the interface via node-lmdb.

The reliance on terminating \u0000 is not a proposal, it has always (ok, since this commit in 2013) been node-lmdb behaviour. However, if you only look at the node-lmdb string interface you will never see it because node-lmdb inserts it into the key when writing and strips it off when reading. The read "stripping off" is unsafe as of this change (ironically):

Safe version of getString, and zero-copy as getStringUnsafe

This is the "unsafe" line:

 auto str = Nan::New<v8::String>((uint16_t*)data.mv_data);

The addition of a "binary" "type" when there is already an explicit / custom conversion between the "lmdb native binary" and node string that is not the same (because of the trimming of the terminating \u0000) as the implicit conversion between a node Buffer and a node string has made this conversation, and the interface to node-lmdb complicated.

The potential confusion in the node-lmdb interface is high wherever it might return a string or a buffer (it would be less so without implicit conversions to string, but we can't change that). In this sense I do think your suggestions of some sort of mode flag is nice. Its a clear indication of what the user is expecting... However, I'd suggest that all that is needed is a flag on the cursor, not the whole dbi. And the flag is "keysAsBuffer" but provided as third parameter when constructing a Cursor, not inferred from whether the word Binary appears in the method name. If not set, behave as now (quirky as it is). If set ONLY accept keys as buffers and ONLY return keys as buffers.

What do you think?

da77a commented 7 years ago

I am writing something to address this bug including the related issues. I'd like to have a pull request accepted, hence the below which describes the issues and proposed fixes (some with options). I'll make up my own mind on the options unless @Venemo wants to tell me what will or won't be accepted (or anyone wants to point out the flaws).

The conclusions I posited originally included a brainfart in thinking there was a way to make the terminating \u0000 optional. Obviously there isn't because:

'a' and 'a\0000' are not equal (and it is lmdb itself that is comparing/looking up keys, so it is what is "written" to lmdb that matters, not what the node-lmdb client sees). There would be 2 distinct records (distinct keys) in lmdb...
It would break anything reading lmdb that was either an old version of node-lmdb or had adopted node-lmdb convention of including a terminating \u0000 for node-lmdb compatibility....

The minimum changes to make lmdb string interface safe are to:

Use a "safe" copy from mv_data to v8:String that checks mv_size.
- If there is no terminating 0, treat this as an error because it is impossible to round-trip this key (or data) format via the existing string interface. This avoids sporadic bus errors (encountered when using cursors - there is no way to directly look up such a key via string-keyed get) and "ghostly" records that can be traversed to/over but cannot be found using a string key. Any code that this change breaks was already broken and could expect to intermittently fault (due to running off end of lmdb allocated/mapped data) and/or return garbage.
  - I've considered alternative ways to flag the error. I don't have a strong preference for exception vs returning a Buffer. Thoughts?
    - Returning an empty string or null for the key would not work because null is already returned when lmdb returns NOT_FOUND. So that's a definite no.
    - Returning a Buffer instead of a string whenever the conversion is not possible will, without explicit checks on type, be converted to a string that is identical to the string return for a different (0 terminated) key... This is close to existing behaviour (some code will not notice the difference), but with the crashing removed and it allows user to check the type and if not a string handle it/report it explicitly. However at least for keys, it is almost certain the existing code is broken (or the database it is accessing is) and forcing (by throwing an exception) a change to fix might be better.
    - Throwing an exception will cause code that was blissfully unprepared to handle non-strings to fail if it encounters a "non string" (i.e. not written by node-lmdb string interface) key/value. This could be seen as a feature/correct. Unlike returning a buffer, code will never implicitly/accidentally convert the result to a string. I like it but I don't care as I'll be avoiding using the string interface altogether.

To actually allow node-lmdb to be used to access databases with non-string, non-integer keys (aka binary keys - but extends to anything that can be converted to from a Buffer):

Make the keys as buffers interface complete so that when using cursors and "expecting" binary keys keys are always returned as buffers (and no internal/implicit conversion of key to string is attempted).
- One fix is to add explicit BinaryKey versions of any key returning methods that don't take a key as a "hint".
- Another is to add the keysAsBuffer parameter to the cursor constructor. I prefer that option.

With those minimum changes it is possible to wrap the (now complete, hopefully) binary/Buffer interface to provide whatever "typed" key (and data so far as that goes) access without modifying node-lmdb itself. It would even be possible to refactor node-lmdb to ONLY use provide this interface as the low level (implemented in C++) layer and implement the string interface 100% in javascript. I'm not proposing to do that, I'm only using it to clarify what constitutes a complete "binary" interface and that with such an interface no further interfaces to node-lmdb are needed to support any key type. Keys as passed to/from lmdb itself are binary and are compared as such etc. They are logically Buffers. They are compared by lmdb itself in the same way as Buffer.compare. No amount of wrapping will change that. Specific string support is a legacy of a quick/simple node specific binding and I'd actually like to propose deprecating it (preferring to just convert to from buffers explicitly using your preferred encodings etc).

Aside re unsigned integer key support. Integer keys are exposed as a special key type by lmdb for reasons with deep history. There are exactly 2 key types in lmdb (binary and integer).

ousado commented 7 years ago

Backwards compatibility is obviously a concern that should be addressed, perhaps by introducing a backwards compatible mode for terminator-aware key handling (but without the potential of reading outside the bounds of the actual key) - and should always use one of

Nan::MaybeLocal<v8::String> Nan::New<T>(const char * value, int length);
Nan::MaybeLocal<v8::String> Nan::New<T>(const uint16_t * value, int length);

or whichever other length-aware conversion.

Going forward, however, I wouldn't know why node-lmdb should both do additional work and rely on the presence of a terminator by default, just because it (accidentally, I assume) has used an API that only exists to accommodate for null-terminated strings in legacy code from the dark ages of programming anyway.

As for the options listed above, please let's not introduce anything that couldn't be properly typed in a statically typed programming language - returning a different type because of the (non-)presence of a terminator strikes me as a bad idea, just as an API that suggests different types of keys within the same database might make sense. I'm not a fan of exceptions, but in the error case for a/the terminator-aware key access, it might just be the right thing to do.

Venemo commented 7 years ago

@da77a @ousado Thank you guys for the discussion!

I've been thinking about this more and I'm absolutely certain that this is a bug that was introduced when we decided that zero-copy should not be the default behaviour of node-lmdb.

The older, "unsafe" implementation does the right thing: the behaviour when using the zero-copy mechanism with CustomExternalStringResource (which predates the newer, faulty non-zero-copy mechanism) correctly uses the mv_size that it gets from LMDB, so I'd say that any place where node-lmdb disregards mv_size is a bug. The solution is to just copy the string correctly. I'll fix this as soon as I can.

About whether or not we need zero-termination: this is not solely a decision that can be made by us, it depends on how it works in V8. I'll need to look this up, does V8 give us zero-terminated strings? Also, does it expect zero-terminated strings? I'll check and get back to you, but I certainly wouldn't want to overcomplicate the node-lmdb API. The primary objective here is to be a "good node citizen" and behave like the average user would expect from a JavaScript API, and the user should definitely not worry about whether the string is zero terminated or not.

Venemo commented 7 years ago

@da77a @ousado Also, the issue about zero-terminated strings has nothing to do with #84 which is about cursor-based keys - these are two different topics alltogether.

da77a commented 7 years ago

@Venemo - the issues are related in that now that there is support for "binary" there is at least some expectation/possibility that one might try (accidentally or otherwise) to read "binary" as a "string". Which is exactly what this issue reports.

But sure, I'll raise a separate bug report of the failure to check length issue on "safe" string get. It's still easier to discuss the related issues of "how to support binary/pass-through properly" and "what to do when attempt to access binary as if it were in "node-lmdb specific utf-16 zero terminated string format" occurs. Clearly one of the things that isn't right in that case is to run off the end of the buffer. What the exact right thing is is more debatable: @ousado vote for exception, @da77a on the fence between exception and returning a Buffer, @Venemo in favour of "just copy the string correctly".

I don't understand the last, I've just spend most of this thread explaining why it is bad/logically impossible. There is no "correct" because you can't copy 0s that aren't there... Yes you can return a valid string but it creates a logical inconsistency/nonsense (Eliding details, assume start with empty db, assume no dup keys, assume buffer overrun bug fixed by "copying correctly"):

var keyStr = 'Hello';
var keyBuffer = Buffer.from(keyStr,'utf16le');
txn.putString(dbi, keyBuffer , 'Hello binary world!');
txn.putString(dbi, keyString, 'Hello string world!');

...

// The following asserts should then, according to your expectations pass. Which is "surprising"...
var resultKey = cursor.goToFirst();
var resultData = cursor.getCurrentString();
assert (resultKey === keyStr);
assert (resultData === 'Hello binary world!');
resultKey = cursor.goToNext();
resultData = cursor.getCurrentString();
assert (resultKey === keyStr);
assert (resultData === 'Hello string world!');

@ousado would instead like the goToFirst() to throw (key not convertible to zero terminated string). I could be convinced to go with @ousado or with a change to the interface that, in this code example, would with a typeof check on resultKey, show that the first key was a Buffer and the second was a string... And regardless, would not consider the 2 keys equal.

And of course the above is also proof that the zero terminated UTF-16 string format is not "required" by V8 or lmdb or... It is entirely your ( @Venemo ) invention. I haven't asked why, it doesn't matter - it is now THE interface, and THE format of node-lmdb "string" keys and data. Changing "string" key format would be a breaking change. So the "legacy node-lmdb specific zero terminated UTF-16 string" data and key format has to live on for users that use it. I'd assumed that much wasn't up for debate?

Everyone else should, imho, use the binary/buffer interface ONLY. Which has to include cursor methods that consistently return keys as Buffers.

Venemo commented 7 years ago

Okay, so I investigated a bit more on the topic. Basically as far as I see, @russtee attempted to save something as a buffer and then read it back as a string. For starters, JavaScript strings are encoded in UTF-16, and @russtee stored a UTF-8 string (the default encoding of Buffer is 'utf8').

Obviously, it should NOT read garbage from the memory. This is now fixed with https://github.com/Venemo/node-lmdb/commit/f5b80e696eca3a4116a21bc76042d47a20e60ce6 - I think we can agree on at least this.
Obviously, reading a UTF-8 encoded string using functions that expect UTF-16 is also not supported. This has came up several times in previous issues, and was discussed a lot. The conclusion is: JavaScript strings and therefore V8 strings are encoded in UTF-16. If you use Buffer.from("hello", "utf8") or Buffer.from("hello") (which also uses UTF-8), and store that using node-lmdb, you cannot read that as a string because we expect strings to be UTF-16. If you really want to work with UTF-8 strings, you can do that with the Buffer-based API.

Those two points aside, we still have a valid use case:

When the user creates a Buffer like this: Buffer.from("hello", "utf16le"), it could be a sane expectation to read that back as a string. This currently doesn't work.

Venemo commented 7 years ago

I added a failing testcase for reading back buffers as strings (with the correct encoding): https://github.com/Venemo/node-lmdb/commit/c0d879993e3854b2727941016e509747e0a6d171

Now the only thing that remains up for consideration is backwards-compatibility with old databases. Right now node-lmdb automatically adds zero-termination for every string (and expects the presence of the zero terminator), so this is the root cause of the problem.

@ousado @da77a Possible solutions are:

(1) Add zero-termination to buffers as well - this would obviously break old databases.
(2) Remove zero-termination from strings - this, too, would obviously break old databases.
(3) Simply document the use case that you need to add the terminating zero manually, when you want to read buffers as strings ― this may not be the most elegant, but is the simplest solution.
(5) Alternatively, we could implement (1) or (2) and add a compatibility switch that keeps the old behaviour.

Venemo commented 7 years ago

To tie up one more loose end, I documented the UTF-8 use case in the readme.

ousado commented 7 years ago

My preferred solution would be (2) and (5), using an option to openDbi that selects the key type for the database, among them e.g. "LegacyString" for the compatibility mode.

Venemo commented 7 years ago

I'm fine with (2) and (5), let's hear what @da77a has to say about it. :)

da77a commented 7 years ago

@ousado and @Venemo.

None of the 5 (4) proposals actually addresses the issue we are commenting on! So I vote for 3.

Views on each:

1) My real use case is broken by this change. Interesting fix - but you are still thinking as though every access to a database is via same library/language/code. That in my experience is rarer than you appear to think. Asking every other bit of code to stick 0s on the end of data to make it work with node doesn't fly. 2) Breaks everything. Can't just update every deployed db never mind anything that accesses it. Would actually end up changing code to avoid using strings interface at all to avoid changing the db format... 3) My favorite. Nobody has a use case that is INTENTIONALLY reading one type and reading another and expecting some specific consistent result/mapping. 4) There is no 4? 5) Has to be a switch to enable the new behaviour (to avoid breaking existing). If so that is ok, but it doesn't fix anything I actually care about. The way I spell "please don't 0 terminate my string" is to simply not use a string type when reading/writing data (always use buffers) and I just need to remember to specify the encoding I want (which isn't always UTF-16 anyway). And that is now the documented way to do it.

da77a commented 7 years ago

The value (data) interface already makes it easy to chose which way you want your data cooked. Which explans my 3 vote.

The Cursor key interface is the only problem. Cursor tries to guess key representation from value representation which is rather odd, and it gives up and returns a string if there is no value involved. In all other cases (transactions) the key is an in parameter and you can pass a string or a buffer - no problems there.

I have a fix for the cursor interface that supports:

1) Exactly existing (default) 2) Always Buffer (yes even if keyIsUint32 - you can represent a uint32 in a buffer of course...) 3) Always UTF16 string (throws if the string doesn't have the expected terminator) 4) uint32 (as per existing)

As far as I can see for every non-cursor interface the key can already passed as either string or Buffer and "just works" (and can't be changed without breaking stuff).

I need to write more tests and most importantly test it fixes my actual use case.

It selects between 1, 4 based on dbi keyIsUint32 param and takes an optional 3rd ctor param to select 2 or 3. Selecting 3 when keyIsUint32 is ignored (just acts as for keyIsUint32 alone). So effectively this selection is "do the sane thing for the actual in-lmdb type".

Venemo commented 7 years ago

@da77a Thanks for your input, it is much appreciated!

Nobody has a use case that is INTENTIONALLY reading one type and reading another and expecting some specific consistent result/mapping.

Yes, that was my assumption as well! I'd vote to leave the string handling as it is (and maybe re-think it later along with other major redesign-related issues like #46 or #51).

The Cursor key interface is the only problem. Cursor tries to guess key representation from value representation which is rather odd

I'd say that the buffer keys feature is (as of now) conceptually broken. This is because I was too lazy to think about possible implications when I reviewed and merged #83.

The real question here is whether or not we want to support different key types for the same database. My original concept (before the buffer keys feature) was that only one key type is allowed per database, that is why keyIsUint32 is implemented as it is. I'm not sure what I was thinking when merging #83 as-is, but I now believe that the current buffer key behaviour is broken on a conceptual level and should be changed to something like a keyIsBuffer switch for openDbi. What do you think about this?

ousado commented 7 years ago

Different key types per database make no sense - where would one get the information which type of key a given entry has? From another database? Based on the value of the key itself? That strikes me as beyond esoteric. openDbi is the right place to determine the key type in a single place, and everything else should just use that information.

da77a commented 7 years ago

Sure, by all means put the parameter in the dbi. The only class that will actually honour/use it unless someone has plans for more (breaking) changes to the node-lmdb API will be Cursor. So if you don't mind I'll implement it in Cursor, and if anyone (you perhaps?) wants to project it through from the dbi feel free.

The comparison to the special case for integer is a red herring. MDB_INTEGERKEY is a genuine flag to LMDB that enables a "small, fixed size key" optimisation that depends, obviously, on all keys in the dbi being the same size. It has NOTHING to do with the actual "integerness" of the key. lmdb itself still passes the key as an MDB_val, it just complains if all key MDB_vals don't have the same mv_size. You could perfectly reasonably store integer keys without using this flag at all.

Note that a breaking change in the API would make all this much cleaner and simpler. If you want to throw a fork?

Venemo commented 7 years ago

@da77a I don't mind the breaking change, especially considering that the current behaviour is simply wrong. I agree with @ousado in that it only makes sense to use a single key type per database. I imagine that a compatibility switch to trigger the old (current, buggy) behaviour might come in handy for those people that use this API, but to be honest I don't believe this is necessary.

Venemo commented 7 years ago

@da77a @ousado The current string VS buffer behaviour (and the need for the terminating zero) has now been documented as of https://github.com/Venemo/node-lmdb/commit/cb2dd744af636d349113f2c9ae020e5569402c27 and a passing testcase has been added.

So, for now:

Let's leave the zero termination as-is (and maybe later add an option to change this behaviour?)
Let's fix the buffer keys feature to work more intuitively, like the integer keys feature

da77a commented 7 years ago

Fixed by #96

Or rather, the behaviour is now well defined. @rustee example now throws for second key (not 'b', and not garbage from buffer over-read) using the "keys as zero terminated strings" interface due to @Venemo cb2dd74 change. Note that #96 had to add a small tweak to avoid wandering off into the weeds even after this change, if someone writes a single BYTE buffer, as @rustee did (because the default char encoding in new Buffer(string) is utf8)....

More significantly, #96 also adds a "return keys as binary" option (third, boolean flag param) to the cursor constructor so that modifying @russtee example to use such a cursor gives sane (or at least educational) behaviour. Modified version of @rustee code illustrating the behaviour with #96 applied:

var lmdb = require('..');
var env = new lmdb.Env();
var fs = require('fs');

if (!fs.existsSync('test')) {
    fs.mkdirSync('test');
}
env.open({path: 'test', mapSize: 1024*1024});
var dbi = env.openDbi({name: "test", create: true});

function debugVar(v) {
    return "'" + v + "' (" + typeof v + ") JSON=" + JSON.stringify(v);
}

const kvs = [
    {k: "a", v: "hello"},
    {k:  new Buffer("b"), v: "goodbye"},
    {k: 'c', v: "sayonara"},
];

console.log("test putString");
var t1 = env.beginTxn();
kvs.forEach(o => {
    console.log("putString: k="+debugVar(o.k)+", v="+debugVar(o.v));
    t1.putString(dbi,o.k,o.v);
})
t1.commit();

console.log("\ntest getString");
var t2 = env.beginTxn();
kvs.forEach(o => {
    console.log("getString: k="+debugVar(o.k)+", v="+debugVar(t2.getString(dbi,o.k)));
})
t2.commit();

console.log("\ntest string cursor");
var t3 = env.beginTxn();
var cursor = new lmdb.Cursor(t3,dbi);

console.log("goToFirst returned "+debugVar( cursor.goToFirst() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
try {
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
} catch (ex) {
  console.log(ex);
}
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));

cursor.close();

console.log("\ntest binary cursor");
cursor = new lmdb.Cursor(t3,dbi, true);

console.log("goToFirst returned "+debugVar( cursor.goToFirst() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));
console.log("goToNext returned "+debugVar( cursor.goToNext() )+", getCurrentString returned "+debugVar( cursor.getCurrentString() ));

cursor.close();
t3.commit();
dbi.close();
env.close();

which produces the following output...


test putString
putString: k='a' (string) JSON="a", v='hello' (string) JSON="hello"
putString: k='b' (object) JSON={"type":"Buffer","data":[98]}, v='goodbye' (string) JSON="goodbye"
putString: k='c' (string) JSON="c", v='sayonara' (string) JSON="sayonara"

test getString
getString: k='a' (string) JSON="a", v='hello' (string) JSON="hello"
getString: k='b' (object) JSON={"type":"Buffer","data":[98]}, v='goodbye' (string) JSON="goodbye"
getString: k='c' (string) JSON="c", v='sayonara' (string) JSON="sayonara"

test string cursor
goToFirst returned 'a' (string) JSON="a", getCurrentString returned 'hello' (string) JSON="hello"
Error: Invalid UTF16 string
    at Error (native)....
goToNext returned 'c' (string) JSON="c", getCurrentString returned 'sayonara' (string) JSON="sayonara"

test binary cursor
goToFirst returned 'a' (object) JSON={"type":"Buffer","data":[97,0,0,0]}, getCurrentString returned 'hello' (string) JSON="hello"
goToNext returned 'b' (object) JSON={"type":"Buffer","data":[98]}, getCurrentString returned 'goodbye' (string) JSON="goodbye"
goToNext returned 'c' (object) JSON={"type":"Buffer","data":[99,0,0,0]}, getCurrentString returned 'sayonara' (string) JSON="sayonara"

Venemo commented 7 years ago

I believe this has been sufficiently discussed, so I'm closing this issue now.

Venemo commented 7 years ago

@da77a @russtee I've just pushed the proper solution. You can now specify the key type on the database: env.openDbi({ keyIsBuffer: true }), which you can alternatively override on txn.put, txn.get, txn.del etc., and in the 3rd parameter of the Cursor. So even if you opened the database wanting to work with a particular key type, you can still override that on a per-operation or per-cursor basis.

What do you guys think about this?

da77a commented 7 years ago

@Venemo , I agree that being able to set a default key type via dbi so you don't have to specify per cursor constructed is handy. I'm also happy you made the call to immediately remove the "to be deprecated" backward compatible behaviour of guessing the key return type for cursor fns from the data type - I only included that in my changes "just in case" you didn't want to do that... However, can I suggest you need to either bump the version number further, or back of the level of checking being done on key parameters in argToKey . While @russtee strongly expressed a view that the API should be "as if" we were in a strongly typed language, this does not mean we can't let the language do type-based overloads/deduction. The txn get/put methods acted correctly for whatever valid key type (string or Buffer) was passed. As did the cursor methods that took a key. It was possible to write entire correct programs (those that didn't use the key returned from a cursor) that correctly inferred the key type from parameters passed. This still works if the program used/uses string keys. It does not work if the program was using binary keys. Specifically, I tried to take your latest version rather than continue working off my fork and couldn't use it for this reason. I can't even freeze at a version I know works but has an "unbroken" API because the version wasn't bumped until after breaking changes to argToKey in 74c28d1.

Venemo commented 7 years ago

@da77a Regarding the version, what do you suggest? If I simply bump the version number to 0.5 now, would that help?

da77a commented 7 years ago

@Venemo - bumping the version certainly makes more sense than breaking changes in a supposedly patch release. Ideally there would be a pure fixes (no API changes 0.4.x with a move to 0.5.x for future dev (and with API changes). Which was the reason (including a fair bit of self interest - non-public code that depends on the API) I was at pains to preserve existing behaviour in my changes.

I realise until 1.0 semver is a bit hit and miss. Do you have any plans/point in time at which you expect the API to be stable?

Venemo commented 7 years ago

@da77a Trust me, I hate making breaking changes as much as receiving them, but I really did consider the buffer-based keys feature to be very broken. That being said, I re-added the type inference logic to txn.get, txn.put and cursor, with the exception of keyIsUint32 (in which case you are not allowed to override it).

Can you try 0.4.7 and see if it works with your code correctly?

Venemo / node-lmdb

cursor 'goto' functions return garbage for buffer based keys. #84