Closed seidewitz closed 1 year ago
Unfortunately, nameUUIDFromBytes()
you used generates only the version 3 of UUID and it cannot specify a namespace and used problematic MD5 hashing algorithm. We should use v5 and generate a namespace for SysMLv2 UUID to avoid conflicts. If you're ok, I think I can contribute. c.f. https://www.rfc-editor.org/rfc/rfc4122.html#section-4.3
@himi
I was, in fact, planning on writing a note about version 3 vs. version 5 UUIDs this morning.
I had hoped to use Version 5, but I did quite a bit of looking for an accepted Java implementation of version 5 UUIDs and didn't find one. It's really rather surprising that this hasn't been added to the standard Java library. There is an Apache Commons implementation, but it is in the Sandbox and requires building from source (and the SVN repository link from their sandbox page is broken, so it is even hard to find the repository!). Maybe https://github.com/f4b6a3/uuid-creator would be another possibility.
I would rather not try to implement and test our own implementation of version 5 UUIDs in the short time that we have. Since the IDs we generate will become normative, I want to make sure we have a solid, accepted implementation for generating them. If you can find something that already exists like that, which we can slip in easily to the implementation, I would be willing to consider it.
As to nameUUIDFromBytes
its implementation of version 3 name-based UUIDs is really identical to version 5, except for the use of MD5 instead of SHA-1. You can add the namespace bytes on to the beginning of the bytes passed to the method, per the UUID specification (this is done in ElementUtil::constructNameUUID
). And, for our purposes, I don't think it is likely that we will have problems generating unique UUIDs across the named elements of our libraries using MD5.
Still, version 5 and SHA-1 would be better (particularly since ITU-T Rec. 667, which is our normative reference, actually says in subclause 14.2 that "MD5 shall not be used for newly generated UUIDs"), so perhaps we can come up with a solution.
Oh I did not understand namespace UUID worked with nameUUIDFromBytes()
and it is compatible with the standard UUID namespace. That reduced the conflict. I think that's sufficient.
If you want to use SHA-1, it's easy to use MessageDigest.getInstance("SHA-1"). I could find an article on it.
https://stackoverflow.com/questions/29059530/is-there-any-way-to-generate-the-same-uuid-from-a-string
P.S.
You are right according to the RFC, the algorithm says just concatenating namespace UUID and name in network octet stream: - Compute the hash of the name space ID concatenated with the name.
Another point to consider is that this update only generates deterministic URIs for elements with non-null qualifiedNames
. It would certainly be possible to come up with an approach for generating UUIDs for other elements, but they all seem to have problems.
For example, one could generate name-based UUIDs for owned relationships by, say, using the owning element as the namespace and their order index as the name. However, this would mean that reordering the relationships, or even inserting a new one in the middle of the list (e.g., inserting a new member in a namespace other than at the end), would cause previously generated UUIDs to be reallocated to different relationships. This would be confusing, but it would be very hard to avoid when maintaining the library models in textual form in the future.
One could also base the UUID of a relationship on the kind of the relationship and the UUIDs of its related elements, but this would only work easily if it could be assured that each instance of a relationship of each kind had a different set of related elements. This can be ensured for, say, ownership relationships, but not in general.
In any case, for standard library models, one would expect that it will be the named-elements that will be referenced from other models, not the unnamed elements. Since the names for these elements are normative, it is important that their UUIDs are normative, too, for interchange and persistence of models in representations other than the textual notation. So it is necessary to do at least this.
If you want to use SHA-1, it's easy to use MessageDigest.getInstance("SHA-1"). I could find an article on it.
I am not comfortable with just copying code from Stackoverflow. However, the article you referenced also links to the uuid-creator
that I mentioned in my previous comment. I might be willing to use that (and it can be used with a Maven dependency), if it is really worth it. (One would have hoped that the Java library UUID
class would be written to make it easy to extend and override nameUUIDFromBytes
to change the use of MD5 to SHA-1, but its not. Humph.)
For me, it's quite simple to do so without copying the code from the article. The core logic is just 5 lines of code.
hasher = MessageDigest.getInstance("SHA-1");
hasher.update(name.getBytes(StandardCharsets.UTF_8));
ByteBuffer hash = ByteBuffer.wrap(hasher.digest());
final long msb = (hash.getLong() & 0xffffffffffff0fffL) | (version & 0x0f) << 12;
final long lsb = (hash.getLong() & 0x3fffffffffffffffL) | 0x8000000000000000L
Do hash and get the leading 128bits with the fixed bit mask.
Oh I found nameUUIDFromBytes()
is not fully compatible with v5. Because the generated UUID version is still 3.
Oh I found
nameUUIDFromBytes()
is not fully compatible with v5. Because the generated UUID version is still 3.
Well, yes, of course it sets the version to 3, because it is using MD5. But the UUID(msb, lsb)
constructor doesn't.
The question is, how confident are we that the five lines of code you quote are correct. It is different than how it is implemented in nameUUIDFromBytes
, so I want to look at it more closely.
This article checked the consistency with nameUUIDFromBytes()
and the algorithm is simple and transparent. But yes, before using it I would like to check the results.
This article checked the consistency with
nameUUIDFromBytes()
and the algorithm is simple and transparent. But yes, before using it I would like to check the results.
If you want to give it a try, you should be able to just update the implementation of ElementUtil.constructNameUUID
. But we need to get it done in the next couple of days.
OK. I'll work on it this midnight ;). The implementation will provide options to choose version and algorithm and the original nameUUIDFromBytes()
.
I added UUIDDigest and made constructNameUUID() use it. I tested it by checking the consistency of nameUUIDFromBytes().
Namespace UUID: 48875c49-9ef6-46bc-91a0-2513702fccb0
nameUUIDFromBytes: abcd -> f2c5d3ea-fe44-3cc2-9664-95be2dc85b7e
MD5 V3: abcd -> f2c5d3ea-fe44-3cc2-9664-95be2dc85b7e
SHA1 V3: abcd -> 75083d56-21ff-3a8b-a653-d342e451050a
SHA1 V5: abcd -> 75083d56-21ff-5a8b-a653-d342e451050a
nameUUIDFromBytes: lmda384d -> 2350a46f-f420-3d81-b617-979e14a1c9dc
MD5 V3: lmda384d -> 2350a46f-f420-3d81-b617-979e14a1c9dc
SHA1 V3: lmda384d -> 7d0611f4-869a-38eb-bbc0-6af92ac9268d
SHA1 V5: lmda384d -> 7d0611f4-869a-58eb-bbc0-6af92ac9268d
nameUUIDFromBytes: あいうえ -> 79ee9435-68b6-3a21-91ed-0670b2367d85
MD5 V3: あいうえ -> 79ee9435-68b6-3a21-91ed-0670b2367d85
SHA1 V3: あいうえ -> cdd1a52f-449c-3995-bcbf-ef9ad429f187
SHA1 V5: あいうえ -> cdd1a52f-449c-5995-bcbf-ef9ad429f187
I also checked the result of SHA1 V5 with this website : https://www.uuidtools.com/v5 I confirmed they exactly matched.
@himi This looks very nice and seems to work well. I regenerated XMI for all the sysml.library models and get v5 UUIDs for all the named elements - which will now become normative.
Thanks!
This PR revises
ElementImpl::getElementId
andLibraryPackage::getElementId
so that the UUIDs for standard library elements with a non-nullqualifiedName
are generated deterministically as name-based UUIDs. Each such UUID is constructed from a name space identifier and a name (as defined in the UUID specification), which are determined as follows:For the top-level standard library package:
NameSpace_URL
UUID given in the UUID specification (which is6ba7b812-9dad-11d1-80b4-00c04fd430c8
).https://www.omg.org/KerML/
(orhttps://www.omg.org/SysML/
for SysML packages) to the name of the package, converted to bytes using aUTF-8
encoding.For any element directly or indirectly contained in the top-level standard library package (for which that package will be the
libraryNamespace
):qualifiedName
of the element , converted to bytes using aUTF-8
encoding.Other changes:
Corrected the derivation of
Element::qualifiedName
so that it is always null if theElement
name
is null.Corrected the implementation of
Element::escapedName
so that it uses the effectiveshortName
(notdeclaredShortName
) if thename
is null.Added a
-o
option for an output directory toorg.omg.kerml.xtext.util.KerML2XMI
(and hence toorg.omg.sysml.xtext.util.SysML2XMI
, too).