Kotlin / KEEP

Kotlin Evolution and Enhancement Process
Apache License 2.0
3.42k stars 362 forks source link

Uuid #382

Open qurbonzoda opened 3 months ago

qurbonzoda commented 3 months ago

This issue is for discussion of the proposed Uuid API. The full text of the proposal is here.

The API is already available in Kotlin 2.0.20-Beta2 and later, and is marked with @ExperimentalStdlibApi.

axelfontaine commented 3 months ago

As Uuid really is a base building block of data models, the package dependencies really matter. For the JVM, instead of having uuid depend on nio, it should really be the other way around!

joffrey-bion commented 3 months ago

I do like that, by default, Uuid.toString() and Uuid.parse() rely on the RFC hex-and-dash format.

However, toString is initially meant for human readability. From the Javadoc of Object.toString(): "The result should be a concise but informative representation that is easy for a person to read". For this reason, I generally don't trust toString() as a reliable and stable way to serialize data (for any class), because it's not part of its contract.

I would find it nice to have Uuid.toRfc9562() or Uuid.toRfc9562String() in the API for the purpose of actually serializing a UUID for program consumption (and not human). toString() would of course use that behind the scenes because the RFC format is also nice for humans. As a side note: we do have Duration.toIsoString() for Kotlin durations. (Admittedly, durations have a different "natural" representation for humans, which means toString() was already "taken", but it kinda proves the point).

lukellmann commented 3 months ago

As Uuid really is a base building block of data models, the package dependencies really matter. For the JVM, instead of having uuid depend on nio, it should really be the other way around!

If I understood this correctly, only the convenience extension functions for reading from / writing to ByteBuffers depend on NIO, not the Uuid class itself.

joffrey-bion commented 3 months ago

@axelfontaine also, what exactly do you mean by "package dependencies"? It's not like there is a maven dependency (nio is in the Java runtime). Are you talking about Java 9 modules and the kotlin.uuid package requiring things? What exact problem are you worried about?

adam-enko commented 3 months ago

I would prefer the name to be UUID over Uuid.

It is consistent with existing names Base64.UrlSafe and Base64.Mime names.

Mime and UUID are both abbreviations, but the former is an acronym, and the later is an initialism.

I can see how UrlSafe could be a counterpoint (should it be URLSafe?), but it follows the official Kotlin naming conventions.

When using an acronym as part of a declaration name, capitalize it if it consists of two letters (IOStream); capitalize only the first letter if it is longer (XmlFormatter, HttpInputStream).

The visual difference between UUID/Uuid is confusing when looking at the Kotlin/Java converter helper:

public fun Uuid.toJavaUuid(): java.util.UUID

Shouldn't the extension fun Uuid.toJavaUuid() instead be named fun Uuid.toJavaUUID(), to match the Java naming scheme?

  • It helps disambiguate whether the Java or Kotlin UUID type is being used without looking at imports.

This same confusion happens with other types (I've encountered this Random/Random). But those cases are exceedingly rare, and Kotlin provides tools to easily deal with them (named imports, or typealiases, or just using the FQN.)

Given that the confusion only happens in Kotlin/JVM, should other Kotlin targets have an unusual UUID name because of some other Kotlin target?

Maybe the confusion be helped by introducing official type aliases?

// kotlin-stdlib/src/jvmMain/kotlin/uuid/uuid-helpers.kt

typealias UuidKt = kotlin.uuid.UUID
typealias UuidJdk = java.util.UUID
JakeWharton commented 3 months ago

Mime and UUID are both abbreviations, but the former is an acronym, and the later is an initialism.

  • 'Mime' is an acronym, and is pronounced as a single word. JSON is another acronym.

  • 'UUID' is an initialism, because each letter is pronounced individually. So is URL. (Although maybe I'm wrong and people do say 'oooo-eye-dee' and 'uhrrrll'(!)).

I am definitely opposed to using pronunciation as a guiding principle on casing for a few reasons:

  1. Initialisms are a subset of acronyms and not disjoint, so by eliminating their special casing (heh) we eliminate an extra rule to remember.
  2. Many acronyms do not have universal agreement or canonical definition as whether they're initialisms or not (e.g., SQL is "S-Q-L" and "Sequel"). This also can vary as to whether English is your first language or not.
  3. It is impossible to tool this differentiation for static analysis except by continually chasing an exhaustive list of initialisms.
  4. Sibling and related acronyms may disagree on initialism, and thus would require different naming conventions (e.g., GUID is typically "goo-id" and so by your rules would be required to be named Guid despite both being rooted in "unique identifier").
  5. Adjacent initialisms in a name offer no visual delineation between them creating confusion for those unfamiliar with one or both terms. Casing them like words eliminates this problem.
OliverO2 commented 3 months ago

I suggest to replace

This would result in

Consistency example:

val uuid2 = Uuid.fromULongs(0x550E8400E29B41D4uL, 0xA716446655440000uL)
val uuid3 = Uuid.fromHexDashString("550e8400-e29b-41d4-a716-446655440000")

I'd also support replacing

for the variant guaranteeing a specific format.

(Uuid.toString can still have the same implementation, just without the stability guarantees for machine readability.)

joffrey-bion commented 3 months ago

Admittedly I prefer toHexDashString() over my initial toRfc9562String() proposal 😆

Amejonah1200 commented 3 months ago

I'm against using from for parsing strings, because UUIDs are 126bit long numbers which have a representation as a string (serialized as) which can be parsed later on.

lppedd commented 3 months ago

I was looking at the compiled JS code, and I think using Long is a pretty bad choice for the JS target, especially when performing bit operations. This is the code for fromByteArray, for example:

fromByteArray_d3r0u1_k$(byteArray) {
  if (!(byteArray.length === 16)) {
    var message = 'Expected exactly 16 bytes';
    throw IllegalArgumentException.new_kotlin_IllegalArgumentException_f8t9r5_k$(toString_1(message));
  }
  return this.fromLongs_uu1aj7_k$(toLong_1(byteArray, 0), toLong_1(byteArray, 8));
}

The toLong_1(byteArray, index) function, which corresponds to:

private fun ByteArray.toLong(startIndex: Int): Long {
    return ((this[startIndex + 0].toLong() and 0xFF) shl 56) or
            ((this[startIndex + 1].toLong() and 0xFF) shl 48) or
            ((this[startIndex + 2].toLong() and 0xFF) shl 40) or
            ((this[startIndex + 3].toLong() and 0xFF) shl 32) or
            ((this[startIndex + 4].toLong() and 0xFF) shl 24) or
            ((this[startIndex + 5].toLong() and 0xFF) shl 16) or
            ((this[startIndex + 6].toLong() and 0xFF) shl 8) or
            (this[startIndex + 7].toLong() and 0xFF)
}

will create more than 50 Long instances to perform those bit operations.
Not to mention the amount of function calls and those 50+ undefined checks because of the lazy initialization of the companion object and its properties.

bcmedeiros commented 3 months ago

Curiosity question, is there anything preventing Uuid to be a value class?

CLOVIS-AI commented 3 months ago

Curiosity question, is there anything preventing Uuid to be a value class?

Currently, value classes can only have a single field (see #340). Based on, what would its field be?

Also, with a value class, the representation must be exactly the same on all platforms and is part of the public API, whereas with regular classes, it can be kept as an implementation detail of each platform, and change if performance problems are found.

JakeWharton commented 3 months ago

Using an array of longs will have higher overhead than a regular class with two fields

A small nit: on the JVM, at least, this is not true. Let's assume a 64-bit JVM running with compressed OOPs (the default).

(I included a byte array version because you wouldn't want to use a long array in general due to its abysmal representation on JS)

So really no matter what you do you're allocating 32 bytes.

A value class, however, will require boxing at certain points (such as when used in a generic context) which will be a 16-byte allocation each time (12-byte header + 4-byte object reference).

You can validate all of this yourself using JOL.

Although having written all this maybe you were referring to runtime bounds checking on the array? There's usually ways to convince the JVM compilers and ART on Android to eliminate/minimize them, but for that I'd need to break out JMH and other tools. I don't think a value class is really being considered here, so I'm not going to bother right now.

CLOVIS-AI commented 3 months ago

I did mean runtime memory usage, but I forgot about padding, so you're completely right that it doesn't make a difference here. Thanks for reminding me!

bcmedeiros commented 3 months ago

I don't think a value class is really being considered here, so I'm not going to bother right now.

I wonder why that is, besides having to settle on a common backing field for all platforms, which might take some effort, a value class makes a lot of sense to me.

JakeWharton commented 3 months ago

I think the backing property type is the largest part of that discussion. It would probably mean this waiting for ByteString to move from kotlinx.io to the stdlib and to be stable itself. But ultimately you likely wind up with an object whose memory footprint matches that of Uuid being a normal class with 16 bytes of data in some storage format. And having a full class means you never have deal with the boxing problem. If Kotlin had 128-bit primitive that had efficient representation on every platform I think the argument would be much more strong.

qurbonzoda commented 3 months ago

Shouldn't the extension fun Uuid.toJavaUuid() instead be named fun Uuid.toJavaUUID(), to match the Java naming scheme?

This does make sense to me, given that the emphasis is on conversion to the specific type. When considering the conversion to Java's representation of the Universally Unique Identifier (UUID), the name toJavaUuid() works as well. However, I think it is better to rename it to toJavaUUID().

bcmedeiros commented 3 months ago

Sorry I got a little bit late to this KEEP discussion, but another thing I just raised in https://youtrack.jetbrains.com/issue/KT-31880#focus=Comments-27-10312473.0-0 is the name choice for the random() function.

I found out about this new type looking for a UUID v7 Kotlin implementation, actually, and I'm willing to eventually even contribute with some code as I had to roll my own too many times already. If/when we do have a v7 generator function, how are we calling it? Why don't we just use the vX() notation as the spec has very clearly defined versions?

The KEEP mentions:

The popularity of each UUID version was also explored. Our findings indicate that in approximately 90 percent of cases users generate a UUID version 4 (random). Excluding time-based UUIDs, this figure rises to over 97 percent.

While this is true, I think we are getting things backwards here. In my experience with UUIDs, I think the lack of a standard trustworthy way of generating other versions of UUIDs other than v4 is the reason many people end up using v4 more than they "should".

For databases, for example, the impact of using v4 instead of v7 on index re-balancing is at least non-negligible (quick example, there are many more).

Not saying v7 or any other version should be a blocker for a initial release, but I think we should aim for excellence in the standard library (as the team has been doing, I think the Kotlin standard library is great) and that would include 2 things in regards to UUIDs:

qurbonzoda commented 3 months ago

From the Javadoc of Object.toString(): "The result should be a concise but informative representation that is easy for a person to read". For this reason, I generally don't trust toString() as a reliable and stable way to serialize data (for any class), because it's not part of its contract.

I would find it nice to have Uuid.toRfc9562() or Uuid.toRfc9562String() in the API for the purpose of actually serializing a UUID for program consumption (and not human). toString() would of course use that behind the scenes because the RFC format is also nice for humans. As a side note: we do have Duration.toIsoString() for Kotlin durations. (Admittedly, durations have a different "natural" representation for humans, which means toString() was already "taken", but it kinda proves the point).

For Uuid.toString(), the documentation explicitly specifies that the returned string is in the standard hex-and-dash format. Thus it becomes part of its contract. I am not sure if adding a new function that duplicates toString() functionality won't cause confusion, both in writing and reviewing code. Theoretically, it might cause development teams to have a guideline about whether to use Uuid.toHexDashString() or Uuid.toString() in their projects, which adds an additional complexity to development process.

qurbonzoda commented 3 months ago

If/when we do have a v7 generator function, how are we calling it? Why don't we just use the vX() notation as the spec has very clearly defined versions?

I understand the concern about adopting a consistent naming convention for the future generator functions. It's important to recognize, however, that not every developer using UUID has read the RFC or understands how each version differs from the others. Since v4 is the most popular UUID, it was decided to generate it with a name that is familiar to users.

While this is true, I think we are getting things backwards here. In my experience with UUIDs, I think the lack of a standard trustworthy way of generating other versions of UUIDs other than v4 is the reason many people end up using v4 more than they "should".

For databases, for example, the impact of using v4 instead of v7 on index re-balancing is at least non-negligible (quick example, there are many more).

Python, Go, and Rust can generate v1-v5 UUIDs, and we have observed a similar trend regarding UUID version popularity in these languages as well.

While I share your enthusiasm for v7 and believe it will gain wider adoption, as described in the KEEP, we see technical limitations in correctly generating it within the stdlib. I believe that database systems and frameworks are the appropriate places for generating v7, as they can handle clock rollbacks and guarantee the monotonicity and consistency of generated UUIDs within the system.

Could you please elaborate on how you would handle clock rollbacks and system restarts within the stdlib ? Or do you think ensuring monotonicity within a single program session would be enough?

OliverO2 commented 3 months ago

For Uuid.toString(), the documentation explicitly specifies that the returned string is in the standard hex-and-dash format.

In the KEEP, it does not specify in-place what "standard" means, but just: "Returns the standard string representation of this uuid."

Thus it becomes part of its contract.

This contract would deviate from learned expectations. It seems unusual to use toString() for what's basically a form of serialization.

I am not sure if adding a new function that duplicates toString() functionality won't cause confusion, both in writing and reviewing code.

Given:

Why would there be a higher risk compared to the level of possible confusion we could always presume?

I'd expect even more confusion with toString() as we might see code using "$uuid".

qurbonzoda commented 3 months ago

Please find the full documentation for Uuid on GitHub. The full documentation was not included in the KEEP document to keep it concise.

bcmedeiros commented 3 months ago

Python, Go, and Rust can generate v1-v5 UUIDs, and we have observed a similar trend regarding UUID version popularity in these languages as well.

I'm not trying to convince us to implement v7 right now, I'd like that but I understand we shouldn't spend time now if there's not enough demand. I just want to make sure we use an extensible and coherent naming convention and those links you posted are just another argument against random().

Python: uuid.uuid4() Rust: Uuid::new_v4()

Go created a NewRandom(), but then used NewV7() for the v7 which is inconsistent. I guess they wanted to avoid deprecation, which is exactly what I'm trying to avoid here by not picking random() to begin with.

If I recall correctly, JS and PostgreSQL also have v4 in their method names.

I understand the concern about adopting a consistent naming convention for the future generator functions. It's important to recognize, however, that not every developer using UUID has read the RFC or understands how each version differs from the others. Since v4 is the most popular UUID, it was decided to generate it with a name that is familiar to users.

Allow me to politely disagree on this, @qurbonzoda. If a developer decides to use UUIDs without basic knowledge about what that type is, there's very little we can do for them. Also, if anything, that's an argument against random(), as this may trick users into thinking that they know what they are doing instead of pushing them towards the documentation.

If you by any chance don't know how to generate a UUID and you do Uuid.[Ctrl + Space], you see a newV4() or anything similar, you can just read the docs or even do a quick google search and figure that out. That is better than seeing random() (v4) and timeWithRandom() (v7) and having no idea what they are.

While I share your enthusiasm for v7 and believe it will gain wider adoption, as described in the KEEP, we see technical limitations in correctly generating it within the stdlib. I believe that database systems and frameworks are the appropriate places for generating v7, as they can handle clock rollbacks and guarantee the monotonicity and consistency of generated UUIDs within the system.

The entire point of using UUIDs is to avoid (or at least having the option of not) generating them inside the database. I understand there are challenges, but as we just saw in the links above many major languages already provide them.

Could you please elaborate on how you would handle clock rollbacks and system restarts within the stdlib? Or do you think ensuring monotonicity within a single program session would be enough?

In my experience clock rollbacks are not critical because we still have a fair bit of the payload being random. Even without monotonicity and rollback avoidance guarantees, v7 is very useful to avoid major and frequent index re-balances in the database. In the future, we can provide multiple v7 factories that allow some kind of time source to be passed in so the user can sync if they want, but I'd pick having any implementation of v7 over holding on things to have a perfect one.

hfhbd commented 2 months ago

Is there any reason why there are no fun NSUUID.toKotlinUuid(): Uuid = Uuid.parse(UUIDString) and fun Uuid.toNsUUID(): NSUUID = NSUUID(toString()) matching the JVM converter functions on iOS/macOS etc.?

qurbonzoda commented 2 months ago

Is there any reason why there are no fun NSUUID.toKotlinUuid(): Uuid = Uuid.parse(UUIDString) and fun Uuid.toNsUUID(): NSUUID = NSUUID(toString()) matching the JVM converter functions on iOS/macOS etc.?

Currently, the stdlib project does not have separate source sets for different Kotlin/Native targets. This is due to how the Kotlin/Native stdlib is built and published. When it becomes possible to have an Apple-specific source set, we will include such conversion functions as well.

hfhbd commented 2 months ago

I just read the rfc9562 and I found these sections.

Regarding sorting:

UUID formats created by this specification are intended to be lexicographically sortable while in the textual representation.

https://www.rfc-editor.org/rfc/rfc9562.html#section-6.11

In Java/Kotlin, the interface to support sorting is Comparable, so implementing the interface sounds correct to me according to the rfc.

Regarding the function/factory names for different versions: https://www.rfc-editor.org/rfc/rfc9562.html#section-7.1 contains a table with names registered by the IANA, so random() may be the preferred name for Uuidv4, and unixTimebased() for Uuidv7. (On the other hand reorderedGregorianTimebased() for Uuidv6 sounds very "interesting".)

Amejonah1200 commented 2 months ago

I support @bcmedeiros argument of implementing various vX(...) methods (tbh, I would also go with a mix of name and ID of it (ex. v4Random(...) or v7UnixTimeBased(...))) to generate those UUIDs (randomly).

Where I do disagree tho, it is the removal of standard non-parametrized random() function, as it is wildly used as a way of getting a random UUID without much thought what exactly to use. But yes, underneath it could be any of those implementations and may or may not be changed over time.

hfhbd commented 2 months ago

Another thing after I switched a project to kotlin.uuid.Uuid is the parsing function in combination with kotlinx-serialization (here Json). Uuid provides two parsing functions: parse requires a 36 length hex and dash string, while parseHex requires a 32 hex string. The default UuidSerializer of kotlinx-serialization uses the parse method, so you can't use the default serializer and pass a 32 hex string. While both implementations are well documented, including its expected input, I would expect a generic parse method that is able to handle both inputs and specific parseHex, parseHexAndDash functions supporting only a subset. My current workaround is a custom parse function that checks the size first, if the size is 36, it uses the parse function, otherwise the parseHex function.

Also, the KEEP mentions possible bracket parsing support in the future. Do you plan to add another parseX function (including all the combinations of hex, hex and dashes and brackets), a UuidFormat and builder function or "just" one parse function to rule them all?

UnknownJoe796 commented 2 months ago

I don't want to switch to this new implementation of KMP UUID until JS efficiency is addressed. In addition, I second @hfhbd 's recommendation to implement Comparable as per the RFC. I think the convenience outweighs the Java transition risks, personally.

qurbonzoda commented 1 week ago

While both implementations are well documented, including its expected input, I would expect a generic parse method that is able to handle both inputs and specific parseHex, parseHexAndDash functions supporting only a subset. My current workaround is a custom parse function that checks the size first, if the size is 36, it uses the parse function, otherwise the parseHex function. Also, the KEEP mentions possible bracket parsing support in the future. Do you plan to add another parseX function (including all the combinations of hex, hex and dashes and brackets), a UuidFormat and builder function or "just" one parse function to rule them all?

I have added a paragraph about introducing parseHexDash()/toHexDashString() and a making parse() support multiple formats to the KEEP document. Please let me know your thoughts on this approach. We will consider options since the Uuid API is still experimental.

cc: @hfhbd, @OliverO2, @joffrey-bion

qurbonzoda commented 1 week ago

I have also added a paragraph about making Kotlin Uuid a mapped type to the Java UUID in Kotlin/JVM. This would help greatly in integrating with existing APIs that work with Java UUID. However, this approach has its own downsides. Please let me know your thoughts about this.

qurbonzoda commented 3 days ago

@UnknownJoe796, could you please share your use cases where a Comparable Uuid would enable a usage or improve the user experience?

@hfhbd has reported a YouTrack issue because Uuid does not implement Comparable. His use case is related to the Exposed library, which requires db table ids to be Comparable. The Exposed team is working on lifting this restriction, and soon Kotlin Uuid will become a first-class citizen in Exposed.

Anyone's input would be greatly appreciated.

axelfontaine commented 2 days ago

While it may seem counterintuitive for UUIDv4, Comparable is an absolute must for UUIDv7 which is inherently sortable and widely used as a database ID. Even if it won't be produced by Kotlin's UUID at first (a shame, but can be fixed later), it will be read into it from day 1!

qurbonzoda commented 2 days ago

@axelfontaine, it would be very helpful if you could share the use case you have in mind for a Comparable Uuid. How could the sortability of UUIDs in databases be utilized in Kotlin? Why do you think a separate comparator (e.g., Uuid.LEXICAL_ORDER) is not sufficient?

OliverO2 commented 2 days ago

Multiple formats

I have added a paragraph about introducing parseHexDash()/toHexDashString() and a making parse() support multiple formats to the KEEP document.

Quoting from the added paragraph:

The concern with supporting multiple formats in parse() is that supporting an additional format in the future could be a breaking change. This situation occurs when code that relies on the rejection of formats not currently supported is linked with a newer version of the Standard Library that accepts other formats.

If a conversion accepting multiple input formats is intended, I'd suggest to document it as being

Naming

On naming: As noted above, and previously, I'm still concerned about deteriorating API quality in the Kotlin stdlib with each additional introduction of an imperative name like parse for a function returning a value.

Example: Even though generally aware of a string's immutability, I found myself introducing an error being misled into assuming that String.replace would actually do an in-place replacement operation.

We should aim for less of this confusion and poor readability, not more.

Mapped type

I have also added a paragraph about making Kotlin Uuid a mapped type to the Java UUID in Kotlin/JVM.

I consider it a good idea to keep Kotlin Uuid independent (and not a Java-mapped type). Using the Java type would tie Kotlin to a limiting and buggy Java implementation. I'd favor better quality (avoiding footguns) and lower complexity over convenience with limited gains (I'd not expect many use cases requiring Java/Kotlin UUID conversions all over the place).

axelfontaine commented 1 day ago

How could the sortability of UUIDs in databases be utilized in Kotlin?

In-memory test data generation, subject to the same ordering as what a DB would return is an obvious one.

Why do you think a separate comparator (e.g., Uuid.LEXICAL_ORDER) is not sufficient?

A separate Comparator will absolutely work. It's just more boilerplate, not as easily discoverable and error prone (must be passed every single time). A Comparator is primarily useful when multiple sorting orders are common. This is not the case for a UUID. Either one doesn't care, or it's lexical. Nothing else.

@qurbonzoda Why all the push-back? Please help me understand your perspective here as I really don't get it. A UUID is very much a primitive-like low level building block and every single other one in this category (numbers, strings) is Comparable for good reason.