HouzuoGuo / tiedot

A rudimentary implementation of a basic document (NoSQL) database in Go
BSD 2-Clause "Simplified" License
2.72k stars 257 forks source link

Why not use UUIDs? #54

Open metakeule opened 10 years ago

metakeule commented 10 years ago

Is there any reason apart from space requirements that the generated ids are no UUIDs. That would make it easier to synchronized different tiedot databases.

HouzuoGuo commented 10 years ago

Hello!

Theoretically, the chance of getting duplicate document IDs is only slightly higher than getting duplicate UUIDs using common algorithms.

What do you think?

metakeule commented 10 years ago

I have no idea about the chance of getting duplicate document IDs. I did not calculate it. However UUIDs looks like some standard to me and PostgreSQL even has an own data type for it (via contrib), that prohibits homegrown ids. I think couchdb also uses uuids.

So I think it is an issue of interoperation and compatibility.

Uuids should allow to synchronize, migrate and combine data between databases that support uuids without have to fear duplicated ids. Think of "foreign keys" between different databases in different DBMS.

HouzuoGuo commented 10 years ago

Thanks.

My primary argument against using UUID as the compulsory (primary) document ID is that UUID generation and storage are a lot more expensive compare to long integer IDs.

Thinking about compatibility - to simplify management of UUIDs in documents, how about having these interfaces:

metakeule commented 10 years ago

The proposed interfaces would be fine for me.

HouzuoGuo commented 10 years ago

all right, shall do!

HouzuoGuo commented 10 years ago

these APIs will make their way into version 3.1 =)

metakeule commented 10 years ago

Great, thanks!

shmakes commented 10 years ago

I know I am a bit late to the discussion, but I just wanted to offer a word of warning about UUID's. They can be great for big, distributed, sharded databases (not sure that is tiedot's aspiration), but beware of exposing them in a text-based web interface (JSON, XML).

If a client written in Java or .NET expects to parse them and create native UUID/GUID objects, they may have trouble if the way you serialize them to a string does not match their platform's native scheme. Even worse, if they are allowed to create objects with UUID's they generate in client code and are expecting to link to them with URI's stored in other files/databases based on those same UUID's, they will likely be unable to find the documents they created because the string representations will be different.

While UUID is a "standard", there are several types and some types have additional sub-types. http://en.wikipedia.org/wiki/Universally_unique_identifier#Variants_and_versions It is a shame that there can be so much variation between a 128-bit integer and a string representation but (to paraphrase Tanenbaum), that's the nice thing about standards: there are so many to choose from. :-)

CouchDB's HTTP interface offers an option for clients to retrieve a batch list of server-generated UUID's in a JSON document to use for inserts. You may want to consider adding that functionality to tiedot's HTTP interface. (None of these concerns really apply to the embedded interface.)

MongoDB stores user-supplied UUID's as a particular type of byte array. It is a real mess when choosing how to display it in a web client. It can be displayed as a JSON sub-object with properties for type and a base64 representation of the UUID data, or an application (like Robomongo) can convert the value into any of three UUID string formats (Java, .NET, and Python).

Tiedot currently uses a 64-bit integer for ID's. (I believe that is the only type currently allowed.)

Some questions: Is that going to be expanded to 128-bit in version 3.1, whether UUID's are being used or not? What about the ability to mix current documents with UUID-based documents in the same database? Will existing applications break if a document with a 128-bit UUID is returned instead of a 64-bit Id?

Side note: If I were going to do a web API from scratch with a unique Id again, I would probably use a 96-bit integer (similar to MongoDB) but use a URL-safe variation on base64 to get a human-readable, 16-character string (with no padding character because 96 is divisible by 6). See: http://tools.ietf.org/html/rfc4648#page-7

Instead of UUID Id's like: 6ec80f37-fe6a-429a-99d8-d0f4500ea7b6 or 6ec80f37fe6a429a99d8d0f4500ea7b6 You get an Id like: Nw_Ibmr-mkKZ2ND0

This is not really standard either, but the encoding is more dense than base-16 or base-10 numbers as text and easier to read and re-type (depending on the font - 1 vs. I vs. l (L)).

Anyway, didn't mean to go this long - just wanted to give a heads-up to avoid some possible pitfalls with UUID's.

HouzuoGuo commented 10 years ago

Oops, I forgot to mention that, in order to get several important bug fixes out ASAP, I marked 3.1 release without the UUID feature =)

That's definitely very helpful shmakes. Document ID generation was a minor headache for quite some time. The original thought resulting in the usage of 64-bit ID was according to my to my rough calculation, that the chance of an ID collision is small enough to be ignored given the current capability of tiedot - being able to handle around 10 million documents.

Well, I certainly look forward to the day when tiedot handles 1bn+ docs, therefore it's definitely worth considering alternative doc ID formats...

omeid commented 10 years ago

I would like to add that including a device and process identifier, similar to MongoDB, will make managing multiple instances[1] of tiedot that painless and safer.

[1] whatever in a horizontal scaling, client-master or any other setup that involves aggregation at some stage.

geoah commented 9 years ago

First of all I'd like to say +1 for UUIDs (v4 preferably).


That been said; I would like to ask if this could be modified/hacked a bit to allow inserting, reading, updating, removing using custom String IDs.

I have some collections where the objects need to retain their original ID that come from an external system/api. The IDs are strings (SHA1 hex strings if anyone is keeping score) and must to be managed via this.

Is there any change to be able to supply our own String IDs at some point or any tips for adding this functionality without completely forking tiedot? :/

    id, _ = nodes.InsertWithStringIDOrSomething(map[string]interface{}{
        "ID": "a2",
    })
    // id should be "a2"

Thanks :)

HouzuoGuo commented 9 years ago

@geoah Thanks for your feedback! The reason of choosing integer as document ID is for better efficiency. Back in version 1.x, there was a version using generated UUID as an alternative to document ID (that was version 1.2 or something), but it used a flawed implementation that could occasionally lead to incorrect query results. Each document ID is associated with a physical location in document data file, the associations are stored in an ordinary hash table; the same hash table implementation is used by user-created indexes. Using string ID (whether UUID or not) will have an overall 10-15% performance penalty, though the implementation should be straight forward:

geoah commented 9 years ago

@HouzuoGuo Thank you for the clarification and help on this.

If I may hijack the thread for just one more second, I came up with the following simple implementation. https://github.com/geoah/ink-haiku/blob/104275bd6ba369923be807b26bb7b6eb3275e645/store.go Any suggestions/ideas on the basic concept? It is missing some proper error returns etc but I was just wondering it this is what you where talking about or I completely missed the point.

Thank you very much in advance!

HouzuoGuo commented 9 years ago

@geoah Your implementation is really good, in fact I wrote a very similar one back in version 1.2 :smiley: The flaw in both of our implementations is the lack of atomicity in document operations. You perhaps have noticed it in insert/upsert operation. A more correct implementation which guarantees operation atomicity would be more complicated: https://github.com/HouzuoGuo/tiedot/blob/master/db/doc.go#L101 Note the two mutexes and one pseudo-lock:

After all of them are taken care of, the custom-ID operations will be truly atomic.

Oh and one more suggestion - for even better performance, consider interacting with custom-ID hash table directly without calling query processor.

geoah commented 9 years ago

Since the internal Col.Insert(), Col.Update(), Col.Delete() methods already take care of the atomicity of the transactions why is it required to re-implement the locking mechanisms?

I'm trying to wrap my head around this but since the wrapper methods call the internal ones, when Store.Update() and Store.Delete() are called at the same time the second internal method that will be called Col.Update() or Col.Delete() will throw an error. I think I'm getting something wrong here hehe~ :)

Should we move this this discussion somewhere else so we don't pollute this thread? Thanks once again.

HouzuoGuo commented 9 years ago

I wouldn't mind having the discussion here. Let's take a look at the function Insert, it calls three database functions all of which guarantee atomicity:

Since the three operations do not all happen at once, consider the possibility of two Nodes having identical custom ID, being stored at the same time:

In this case, the custom ID is indexed with two documents.

The index mechanism will return both documents upon query, and if the document is to be deleted later on, it will have to be deleted twice - unless your Delete implementation works around it ;)

geoah commented 9 years ago

A very (very very) quick and dirty solution should solve most of these issues. https://github.com/geoah/ink-haiku/commit/b1d69e96d7b79e256ce6736fe04cdd5bfb893de3

I know that I'm shooting performance in the head (twice) as everything is being locked instead of just locking a single data partition as you are doing in tiedot. I'll try at some point to add a hash table for the CustomID/UID and export the rest of the methods. Till then performance isn't a great issue.

It doesn't seem possible to col.db.schemaLock.RLock() as it is not being exported so I'll just try to remember not to mess with collections names or indexes while doing other stuff! :P

ps. Out of curiosity... If I understand your implementation correctly one of the reasons for the high performance penalty if using strings IDs would be the data partitioning and the ability to only lock a single partition at a time. Couldn't this be solved by using a rainbow style partitioning? Or am I totally wrong about this being a main reason of the performance penalty?

Thanks once more. Tiedot seems more wonderful as I dive deeper into it :)

geoah commented 9 years ago

pps. It's kinda late so I might not be making much sense... by "rainbow tables" I actually was referring to the way rainbow tables are sometimes (I think?) stored. eg IDs "aaaa", "aaab" will be stored under the a.a.a partition path. as a and b while the aabb and aabc under a.a.b as b and c.

The idea would be to have multiple partitions again but this time split it using leading characters instead of ID % partition count. Or something like this.

HouzuoGuo commented 9 years ago

Thanks for the complement. Your locking approach is definitely on the right track. The user created indexes work in this way: they hash the value (in document), and store (one or more) associations between the hash and physical document location, for example "A" -> 9284 (hash) -> 0 (physical location).

And the hash table implementation certainly does not care whether the input key is a hashed result or not, it can store arbitrary integer -> integer associations. Therefore, the currently used integer ID is as efficient as one Get operation on the hash table. Also knowing that the ID is supposed to be unique, the result of hash table's Get operation (which is document physical location) can be used to fetch document without having to verify against hash collision of document ID.

By using a string ID, it will have to get hashed before being put into hash table. Next, to avoid hash collision, every GetDocByCustomID will have to JSON deserialise the document content and verify against hash collision. The hash table will also become less efficient as the ID lookup must return all results (if there are multiple) against an ID hash, because of potential collisions.

kylewolfe commented 9 years ago

I'd like to throw in my suggestion in case if for some reason tiedot moves away from integer ids to a string. We could use something with some more functionality in it. Being able to parse certain info right out of the ID could be useful IMO (such as created date). See ObjectId http://godoc.org/labix.org/v2/mgo/bson for how this can be accomplished.