PacktPublishing / Hands-On-Software-Engineering-with-Golang

Hands-On Software Engineering with Golang, published by Packt
MIT License
426 stars 127 forks source link

Why can you compare uuid strings ? #14

Closed iamleson98 closed 3 years ago

iamleson98 commented 3 years ago

Hello mr. Achilleas To be honest, I like your book and It brings light to my golang career pretty much than I could imagined at first. And still I found there is an issue I don't understand in your code yet.

Why do you compare uuid strings?

package main

import (
    "fmt"
    "time"

    "github.com/google/uuid"
)

func main() {
    u1 := uuid.NewString()

    time.Sleep(3 * time.Second)

    u2 := uuid.NewString()

    fmt.Println(u1 < u2)
}

I tried this code and the results varies, not all true And are you sure databases like mysql/cockroach can compare uuids depend on there creation time ?

Thank you.

atc0005 commented 3 years ago

@iamleson98: I tried this code and the results varies, not all true

There are several things happening here:

I'm not associated with the repo (other than a fellow learner like yourself), but responding to the one item you noted in case it is helpful. Apologies if everything I've noted was already clear to you.

achilleasa commented 3 years ago

Hi there.

Note that there are different kinds (versions) of UUIDs. In particular, as you can see in the docs for google.uuid, uuid.NewString creates a random (a.k.a V4 UUID) UUID.

For your example to work, you would need to generate V1 UUIDs which contain a time component using uuid.NewUUID().

achilleasa commented 3 years ago

To further clarify, you should think of UUIDs as 128bit numbers. So in the context of the book, the idea is to carve the 128bit space into N segments. Any valid UUID is just an 128bit number which must fall within one of those segments. In that case, we can correctly use a less-than comparison to check whether a value U lies in the range [segment start, segment end].

iamleson98 commented 3 years ago

Hello mr @achilleasa. Thank you for kindly replying me, instantly.

Could you take a look at this function in your code, link below ? As you provided above, if I want this function to work as expected, I have to provide UUID V1 instead of the default v4 ?

// this is V4
uuid.New()

https://github.com/PacktPublishing/Hands-On-Software-Engineering-with-Golang/blob/bcd9942a2b473e056c49a75d98f58ea18aa56191/Chapter06/linkgraph/store/memory/memory.go#L89

Thanks.

iamleson98 commented 3 years ago

Hello mr. @atc0005 Thank you for kindly answer me.

I was also wondering how unique are uuids, and I found this stackoverflow answer. That's odd. they are still can be duplicated in very low chance.

https://stackoverflow.com/questions/1155008/how-unique-is-uuid

achilleasa commented 3 years ago

@iamleson98 I am not entirely sure what you expect to achieve by switching to V1 UUIDs. Can you please elaborate on the particular use-case you have in mind?


As far as the linked method is concerned, what you actually want is V4 UUIDs. For the use-case described in the book, we want to be able to spin up N workers and (evenly) distribute the links that need to be crawled among them.

As V4 UUIDs are essentially a random number in the (0, 2^128 - 1) range (note: the 0 and 2^128 - 1 values are special we ignore them for now) we expect that the UUIDs assigned to the links we discover and add to the store to be evenly (more or less) distributed in this range. It might help if you visualize this range as a circle.

Let's say we have 2 workers. We split the range in 2 equal sub-ranges and assign:

Each worker invokes the Links method with the assigned range and should get back approximately 50% of the links in the database. Note that the same logic also applies to the Edges method (returns the edges that originate from a link in the specified UUID range).

(As a thought experiment try to come up with the sub-ranges when you have 4 workers)

Generalizing this for N workers: as long as we know the number of workers (N) and the index of each worker (a value from 0 to N-1) we can split the range in N equal segments and assign each worker approximately 1/N of the available links. This makes it quite easy to scale up/down the number of workers and efficiently rebalance the assigned work for the remaining available workers.


If you were to switch to V1 UUIDs (i.e. with a time component included) your link IDs would not be evenly distributed in the UUID space but you would most likely end up with clusters (hot spots) of link records based on the time they were added to the store. As a result, if you use the above method for assigning ranges to workers you will not using the available resources efficiently as some workers may be idling while others do most of the work.

Random UUID generation can certainly cause collisions. However, this is highly unlikely unless you have already generated a significant amount of UUIDs. In any case, the in-memory store implementation includes logic to check for duplicate UUIDs and generate a new one in case of a collision. The cockroachdb backend has the DB generate the UUIDs which are guaranteed to be unique as they are the PK of the row.


An interesting use-case for V1 UUIDs would be if you were adding records to a database and wanted to efficiently perform range queries based on the insertion time (e.g. select the records that were added in the last N minutes). In this case, if the UUIDs were the primary key for the row, the DB would use the PK index to efficiently answer that query.

Hope that answers your question.

iamleson98 commented 3 years ago

@achilleasa mr, Please take a look at this line https://github.com/PacktPublishing/Hands-On-Software-Engineering-with-Golang/blob/bcd9942a2b473e056c49a75d98f58ea18aa56191/Chapter06/linkgraph/store/memory/memory.go#L95

As you have explained, I can see the 3 variables from, to, id are string representations of uuid v4. As I provided above, We will be likely find arbitrary links, not in any order.

If we save links into database with links have structure like below, then we can only find a range of links by comparing their creation times,

CREATE TABLE Links (
          id uuid primary key, 
          url varchar(250),
          create_at timestampt default NOW(),
)

we can find links in a range if you we use int64 instead of uuids

achilleasa commented 3 years ago

@achilleasa mr, Please take a look at this line https://github.com/PacktPublishing/Hands-On-Software-Engineering-with-Golang/blob/bcd9942a2b473e056c49a75d98f58ea18aa56191/Chapter06/linkgraph/store/memory/memory.go#L95

As you have explained, I can see the 3 variables from, to, id are string representations of uuid v4. As I provided above, We will be likely find arbitrary links, not in any order.

Correct! The order is not important and we actually get some benefits of it being random: if the site we are crawling has a large number of links they will be distributed to different workers; so, we will curl them at different (i.e. not sequentially) times and from potentially different source IP addresses. This makes it less likely to get blacklisted/throttled by the sites we crawl.

If we save links into database with links have structure like below, then we can only find a range of links by comparing their creation times,

CREATE TABLE Links (
          id uuid primary key, 
          url varchar(250),
          create_at timestampt default NOW(),
)

we can find links in a range if you we use int64 instead of uuids

That would probably work. However, if you use int64 as the ID, you will need to first make a query to find the min and max IDs in the database so you can generate the ranges you need. However, as links gets deleted, you will end up with holes in the PK range so again (over time) the workers might be assigned uneven batches of links.

iamleson98 commented 3 years ago

One more question please, Mr. @achilleasa !

Thank to your explanation, I see the benefit of using uuids as primary keys in database tables, especially in distributed systems.

I am working on a golang web project. I set all primary keys of all tables to UUIDv4 Now it comes to pagination and we use cursor based style.

However, it seems like UUIDv4 is not suitable in this situation while int64 primary keys fit perfectly.

In graphql, to paginate rows we have to provide after/before base64 encoded cursors. These cursors simply have the form <record_id>:<Table_Name> but made opaque by base64 encoding.

After being decoded, these <record_id> intergers will be used to filter the table records. E.g:

SELECT * FROM "TableName" WHERE "Id" > ? AND "Id" < ?;
-- quotation marks are replaced with <record_id>s above.

But if these primary keys are UUIDv4, We can not do sql filter like that, since they are in arbitrary order. Am I lacking knowledge at this then please point out my ignorance, I appreciate. If that's really the problem, could you provide a solution for this?

Thank you!

achilleasa commented 3 years ago

Not sure if I can help out here as I don't really know the internals of the thing you are working on. However, in your case i's probably better to use ints (uint64 instead of int64 if you don't want any surprises if the ID overflows and rolls over) and either use offset queries for pagination or a range query on the ids as you propose.