Open robhogan opened 4 years ago
If indeed this package has no active maintainers, I'd like to volunteer to pick it up and restore the original short-and-unique functionality. There's a place for both this package and nanoid
since they serve different purposes.
Finally, I found an expert to talk about the unique ID topic! I like the topic, but it is hard to find a person to talk about my passion :).
TLDR; old shortid didn’t use a unique machine ID. As a result, old way had a bigger collision probability that the current one.
Yeap, you are right, we moved from mixing environment data to using hardware randomness. But we made it for the reason.
Uniqueness guarantee is a very complex problem. And it became even more complex when we start to generate random IDs across multiple computers in the cluster (at the same time). The problematic was studied very well in different UUID versions.
shortid
used the mix of “UUID v1” and “UUID v4” ways. It uses a few unique components like cluster-ID, time and hardware random bytes. But this way created a few big problems and we had a multiple issues in this repo:
shortid
on many machines at the same time you will get higher collision probability.In that commit, I removed the old bytes packing system to:
Yeap, I thought about the major releases. But here are my reasons why I didn’t do it:
shortid
are dead right now. Nobody will update dependencies.shortid
“public API” was to give random IDs. Implementation can be threaded as a private ID.Yeap, we can do it more properly. Let’s discussed after the collision topic.
I believe that the whole way of mixing “UUID v1” and “UUID v4” was wrong. “UUID v1” is very very hard to implement.
Right now “UUID v4” is a much safer way since it is easy to implement and using >= 122 bits of hardware random generator gives extremely low collision probability:
For there to be a one in a billion chance of duplication, 103 trillion version 4 IDs must be generated.
UUID v4 is an industry-standard right now. It is safe to use.
I made a new library based on good “UUID v4” way: Nano ID. It is faster, smaller and has less collision probability than old and new shortid
versions.
Thanks for the detailed response, appreciate it. It sounds like your main complaint about shortid @ 2.2.8 was that it wasn't clear that "worker" ID should be set differently on each process, and that in some environments (browser) it's barely even possible, so users would encounter collisions if they were trying to use it as a replacement for UUIDv4.
However, assuming a user did read the docs, did know what they were doing and did set worker ID correctly (or only ever had one process generating IDs), then shortid was cast-iron guaranteed unique within the scope of your cluster. I personally ran a 5-node backend service with it for years, and I never had to guess what my demand would be, never had to customise ID length, never had to check for collisions, and could've gone for years more without thinking about it, at least assuming I never updated to 2.2.9.
There's also the space efficiency consideration - for a random ID generator to be useful it has to have a very sparsely populated range of potential values, so that a new random ID has a very low probability of collision. UUIDv4 has a colossal range and no two will ever be generated the same, but it has to be very long to achieve that. It's entirely dependent on the size of my record set vs the size of my id space.
If my IDs are based on worker + timestamp + counter, then it doesn't matter how many records I have already. Every new second of time guarantees a big empty chunk of id space, and we know how to fill it efficiently. I could generate a shortid every 1ms across half a dozen workers from now for the next 10 years and I'd never need more than 7 characters, and there'd be zero probability of collision. My id space is relatively efficiently filled, because I'm filling it with an algorithm and I know where the gaps are.
So in the case of ID generation at a fairly constant rate over a very long time span, shortid
is shorter than nanoid
, and has a lower (zero) collision probability.
Of course UUIDv4 is also practically unique, but because it aims to be globally unique across all instances of everything, it's definitely not short, and this matters when you're conscious of things like index size or, device storage, document size, power drain...
nanoid
is a shorter UUIDv4 with a customisable collision probability. That's great for distributed processes where you either can't assign worker IDs or you have so many of them that encoding their IDs is a problem. It's also good for short-lived applications where bytes used to encode a timestamp are a waste.
shortid
is a shorter UUIDv1 for local uniqueness - mainly backend systems with a known, identifiable number of instances. It does/did something that can't be done with random ids.
Consider that shortid
describes itself as an alternative to mongodb's ObjectID, and look at how ObjectIDs are generated by the mongo process. It's the same process ID + timestamp + sequence approach, because over a controlled cluster that works extremely well.
I personally ran a 5-node backend service with it for years
How do you set worker ID for now?
On that multi-instance system (on AWS) I think I just used the last part of a private IP address since I had one node process per instance/IP on a small subnet.
On a manually scaled system it's obviously fairly easy to set an ID as an environment variable or similar as part of provisioning.
On an auto scaling GCP system I'm working on now I could probably set an integer instance ID as instance metadata, and when a new instance boots that instance could allocate itself the lowest unused integer (the cluster has access to cluster metadata via an http API).
Lots of ways to tackle that problem - more of a devops thing really. The point is that if you can determine an ID per worker, this library would give you unique IDs.
Hm. The current API doesn’t explain how cluster-ID is important. We can revert random-based generation and add new API which will be clear about node ID:
let shortid = require('shortid')
shortid.generate()
//=> Error, You didn’t set a unique worker ID. Read github.com/dylang/shortid#worker-id show to set it properly
shortid.uniqueWorkerId(getIPAddress())
shortid.generate() //=> "PPBqWA9"
uniqueWorkerId()
was already there, it's just called worker()
.
I guess it you could improve the function naming, but it's all in the readme.
Yeap, I forget to explain my reasons behind this idea:
shortid.uniqueWorkerId()
will explain that ID should be unique. It stops users from putting there 0
just to avoid an error message. But maybe we can keep worker()
and error with a link to docs will be enough.I don't really think that's necessary but it is subjective, if you want to make it idiot-proof at the expense of simplicity in the majority case I guess that's your call.
What's not subjective at all is the fact that https://github.com/dylang/shortid/commit/0e9c56059b6ff31e016c78ef13d647080a035fb4 should not have been approved as a patch release, or at all frankly - that was obviously a mistake. Hopefully it hasn't caused a massive problem to too many people.
What's not subjective at all is the fact that 0e9c560 should not have been approved as a patch release, or at all frankly - that was obviously a mistake
It is much more complicated.
2.2.9 worse for people with the correct worker ID. But most of the users use shortid
in a browser or do not set correct worker ID.
Users who don’t set worker ID in 2.2.9 have lower collision probability. But I agree that people who set worker ID now have bigger collision probability (but now very big). Because more people use shortid
wrong, 2.2.9 made world better.
This is why we can’t just revert 2.2.8. It will make better for you, but worse for most of the users.
This is why I suggest a better way. Change API to force everyone to set correct worker ID and revert old method.
If you think about how this is used, anyone who has been storing ids with >=2.2.9 now has random ids in their database. There’s no way to revert things so that we won’t collide with those random ids - the id space is of all current users of this package is permanently polluted. I’m not sure where you go from here.
Why we can’t change version
here? https://github.com/dylang/shortid/blob/master/lib/build.js#L13
Because we don’t encode the version any more - we now set the first character completely randomly because generate
doesn’t respect the data passed to it. All possible versions are therefore taken.
Hm, yeap, you are right. We uses version
to increase ID size.
You can only really publish a new major version 3, which restores 2.2.8 behaviour, and explain that v3 ids may collide with v2 ids, because unfortunately v2 ids were non-uniquely generated contrary to the docs.
There’s nothing that can be done for people who relied on the package behaving as documented after 2.2.9, unfortunately. They may now have an ID migration to have to go through.
We can’t just restore 2.2.8 without solving the problem of forgetting a set worker ID. But we can do both: restore 2.2.8 and fix API to force everyone to set worker ID.
You can make the API whatever you like in v3, obviously. The behaviour of 2.2.8 is the important part - ie zero-collision guarantee.
You mentioned that you want to become a maintainer. I can help you with 3.0 with reviewing PRs and then I will give you access.
But it is important for me to find a solution for everyone, not only for people who set worker ID correctly, since 2.2.9 was released to fix problem of worker ID bad practices.
Forgetting about v3 for the moment, I do really think you should make it very clear (shout it, big letters!) in the readme that 2.2.9 and onwards do not offer uniqueness, to the extent of publishing a warning/deprecation notice to npm. The readme at the moment is considerably out of step with what this package now does.
I don't mean to be dramatic but I really feel sorry for people who have used this package long-term thinking it was a good choice for time-based unique IDs - now assuming they've been patching their dependencies they have unknowingly introduced an insidious bug - an increasing possibility of collisions - and no way to fix/avoid collisions with the random IDs they have already accidentally generated.
In the short term I'll maintain a fork, https://www.npmjs.com/package/@rh389/shortid, which will always guarantee uniqueness.
CC @dylang, if he's still out there!
Sure, send PR to docs
literally went here because of this example of "trusting docs without checking": https://www.youtube.com/watch?v=SLpUKAGnm-g
i was curious and ended up here.
docs should probably point out that id's can in fact collide and uniquenes should always be checked.
Hi,
From https://github.com/dylang/shortid/blob/master/README.md#shortid---
I've been trying to figure out how
shortid
guarantees uniqueness, and I've come to the conclusion that in fact it doesn't at all (any more).The way this module worked as at https://github.com/dylang/shortid/tree/2.2.8/lib is that it encoded a number of variables (version, worker, seconds offset, counter), mixing in some randomness so that the ID wasn't guessable - however the original data was still hidden between those random bits, and it could be decoded by bit-masking. Given different input the ID was therefore guaranteed to be different (at least subject to https://github.com/dylang/shortid/issues/105).
Since https://github.com/dylang/shortid/commit/0e9c56059b6ff31e016c78ef13d647080a035fb4 that's all broken. The number to be "encoded" now only influences the string length, but doesn't actually make it into the string at all https://github.com/dylang/shortid/blob/14039444de4c920cd06bc242a7bc62d69804aaff/lib/generate.js#L14 The string is actually completely random, and has at least as much chance of collision as any random string of its length and alphabet. None of version, worker ID, or time have any impact whatsoever on what is generated.
Please tell me I'm wrong!
Edit: I notice @ai has been closing issues with the assertion that this package is deprecated, but in anticipation of that I'll just note -
nanoid
, though it seems good at what it does, is not a like-for-like replacement - that's a random generator, not a unique generator.It seems to me like when maintenance changed hands after 2.2.8 there might've been an honest misunderstanding about what this package did / was meant to do / how it worked.