arangodb / arangodb

🥑 ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
https://www.arangodb.com
Other
13.59k stars 836 forks source link

[feature request] _key as number when collection keyType=autoincrement and allowUserKeys=false #4247

Open Mararok opened 6 years ago

Mararok commented 6 years ago

For document pagination I used query like that: FOR doc IN collection FILTER TO_NUMBER(doc._key) > @lastId SORT TO_NUMBER(doc._key) ASC LIMIT 100 RETURN doc

Its similar to 1. Paging without discarding records in http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql

Problem: For big collection its need too much memory and is slow

Simple solution: Add "virtual" attribute to document: "_keyAsNumber" and add skiplist index on it Update query after create document: LET doc = DOCUMENT(@id) update doc WITH { _keyAsNumber: TO_NUMBER(doc._key) } IN @@collection New pagination query: FOR doc IN collection FILTER doc._keyAsNumber > @lastId SORT doc._keyAsNumber ASC LIMIT 100 RETURN doc Its using skiplist index scan and its fast but needs extra write on document

Native Solution:
When collection keyType=autoincrement and allowUserKeys=false "_key" value can be number

Simran-B commented 6 years ago

Hm, actually any document key can be sorted because there is a defined lexicographical order (it depends on the server language though). It is not very useful for number strings though. TO_NUMBER() turns it into a number which can be compared numerically, but you lose index utilization (with RocksDB storage engine this could possibly be optimized). Instead of providing a virtual property which exists sometimes, a different comparator might do the job. If I'm not mistaken, it can be as simple as comparing string length first and character by character second if the lengths are equal (it stops e.g. "30" from being greater than "100").

Mararok commented 6 years ago

Checking character by character is more time cost(maybe with some SSE magic it will be faster) then comparing "native" numbers but you can compare numbers greater then 64bit. I found article about it https://www.arangodb.com/2017/09/sorting-number-strings-numerically :).

New comparator for case where Collection option allowUserKeys=false and skiplist index on _key will be very helpful.

jsteemann commented 6 years ago

_key is always a string in ArangoDB, and I think it has to stay like this.

Consider the HTTP REST API, for example for HTTP GET /_db/<dbname>/api/document/<collname>/<key>. If <key> can either be numeric or a string it would be totally ambiguous what type (number or string) the lookup should be for if you pass in a key such as 1. Should this be the numeric key 1 or the string key 1? There is simply no way to express this unambiguosly and downwards-compatible in the existing REST API. And there are other places like that.

There are a few future ways out of this:

Simran-B commented 6 years ago

The problem with zero-padding is, that one has to reserve a lot of digits making the numbers really long and hard to deal with for humans, or you reserve less but may run out of numbers at some point. Comparison would be string-based - but I doubt that this would be any performance issue if you consider how long it takes to actually fetch the document either from main memory or from SSD / spinning disk.

What would actually happen if documents stored their initial revision (_rev) and you would sort by this field? For a single instance, I assume it would be in creation order. Not sure about cluster though.

jsteemann commented 6 years ago

@Simran-B : agreed. Though actually with around 20 digits you should be safe enough for the foreseeable future (10 ^ 20 should provide a space big enough). Potentially one can also get away with several digits less. However, zero-padding would still be a workaround from my point of view. Having some trigger / transformation function on insertion would be my preferred way, because that would allow to plug more or less arbitrary rewriting functionality to the documents, which would allow zero-padding but also completely different things.