Systems-Modeling / SysML-v2-Pilot-Implementation

Proof-of-concept pilot implementation of the SysML v2 textual notation and visualization
GNU Lesser General Public License v3.0
128 stars 24 forks source link

ST6RI-640 Deterministic UUIDs for standard library elements #457

Closed seidewitz closed 1 year ago

seidewitz commented 1 year ago

This PR revises ElementImpl::getElementId and LibraryPackage::getElementId so that the UUIDs for standard library elements with a non-null qualifiedName are generated deterministically as name-based UUIDs. Each such UUID is constructed from a name space identifier and a name (as defined in the UUID specification), which are determined as follows:

Other changes:

  1. Corrected the derivation of Element::qualifiedNameso that it is always null if the Element name is null.

  2. Corrected the implementation of Element::escapedName so that it uses the effective shortName (not declaredShortName) if the name is null.

  3. Added a -o option for an output directory to org.omg.kerml.xtext.util.KerML2XMI (and hence to org.omg.sysml.xtext.util.SysML2XMI, too).

himi commented 1 year ago

Unfortunately, nameUUIDFromBytes() you used generates only the version 3 of UUID and it cannot specify a namespace and used problematic MD5 hashing algorithm. We should use v5 and generate a namespace for SysMLv2 UUID to avoid conflicts. If you're ok, I think I can contribute. c.f. https://www.rfc-editor.org/rfc/rfc4122.html#section-4.3

seidewitz commented 1 year ago

@himi

I was, in fact, planning on writing a note about version 3 vs. version 5 UUIDs this morning.

I had hoped to use Version 5, but I did quite a bit of looking for an accepted Java implementation of version 5 UUIDs and didn't find one. It's really rather surprising that this hasn't been added to the standard Java library. There is an Apache Commons implementation, but it is in the Sandbox and requires building from source (and the SVN repository link from their sandbox page is broken, so it is even hard to find the repository!). Maybe https://github.com/f4b6a3/uuid-creator would be another possibility.

I would rather not try to implement and test our own implementation of version 5 UUIDs in the short time that we have. Since the IDs we generate will become normative, I want to make sure we have a solid, accepted implementation for generating them. If you can find something that already exists like that, which we can slip in easily to the implementation, I would be willing to consider it.

As to nameUUIDFromBytes its implementation of version 3 name-based UUIDs is really identical to version 5, except for the use of MD5 instead of SHA-1. You can add the namespace bytes on to the beginning of the bytes passed to the method, per the UUID specification (this is done in ElementUtil::constructNameUUID). And, for our purposes, I don't think it is likely that we will have problems generating unique UUIDs across the named elements of our libraries using MD5.

Still, version 5 and SHA-1 would be better (particularly since ITU-T Rec. 667, which is our normative reference, actually says in subclause 14.2 that "MD5 shall not be used for newly generated UUIDs"), so perhaps we can come up with a solution.

himi commented 1 year ago

Oh I did not understand namespace UUID worked with nameUUIDFromBytes() and it is compatible with the standard UUID namespace. That reduced the conflict. I think that's sufficient.

If you want to use SHA-1, it's easy to use MessageDigest.getInstance("SHA-1"). I could find an article on it.
https://stackoverflow.com/questions/29059530/is-there-any-way-to-generate-the-same-uuid-from-a-string

P.S.

You are right according to the RFC, the algorithm says just concatenating namespace UUID and name in network octet stream: - Compute the hash of the name space ID concatenated with the name.

seidewitz commented 1 year ago

Another point to consider is that this update only generates deterministic URIs for elements with non-null qualifiedNames. It would certainly be possible to come up with an approach for generating UUIDs for other elements, but they all seem to have problems.

For example, one could generate name-based UUIDs for owned relationships by, say, using the owning element as the namespace and their order index as the name. However, this would mean that reordering the relationships, or even inserting a new one in the middle of the list (e.g., inserting a new member in a namespace other than at the end), would cause previously generated UUIDs to be reallocated to different relationships. This would be confusing, but it would be very hard to avoid when maintaining the library models in textual form in the future.

One could also base the UUID of a relationship on the kind of the relationship and the UUIDs of its related elements, but this would only work easily if it could be assured that each instance of a relationship of each kind had a different set of related elements. This can be ensured for, say, ownership relationships, but not in general.

In any case, for standard library models, one would expect that it will be the named-elements that will be referenced from other models, not the unnamed elements. Since the names for these elements are normative, it is important that their UUIDs are normative, too, for interchange and persistence of models in representations other than the textual notation. So it is necessary to do at least this.

seidewitz commented 1 year ago

If you want to use SHA-1, it's easy to use MessageDigest.getInstance("SHA-1"). I could find an article on it.

I am not comfortable with just copying code from Stackoverflow. However, the article you referenced also links to the uuid-creator that I mentioned in my previous comment. I might be willing to use that (and it can be used with a Maven dependency), if it is really worth it. (One would have hoped that the Java library UUID class would be written to make it easy to extend and override nameUUIDFromBytes to change the use of MD5 to SHA-1, but its not. Humph.)

himi commented 1 year ago

For me, it's quite simple to do so without copying the code from the article. The core logic is just 5 lines of code.

        hasher = MessageDigest.getInstance("SHA-1");
        hasher.update(name.getBytes(StandardCharsets.UTF_8));
        ByteBuffer hash = ByteBuffer.wrap(hasher.digest());

        final long msb = (hash.getLong() & 0xffffffffffff0fffL) | (version & 0x0f) << 12;
        final long lsb = (hash.getLong() & 0x3fffffffffffffffL) | 0x8000000000000000L

Do hash and get the leading 128bits with the fixed bit mask.

himi commented 1 year ago

Oh I found nameUUIDFromBytes() is not fully compatible with v5. Because the generated UUID version is still 3.

seidewitz commented 1 year ago

Oh I found nameUUIDFromBytes() is not fully compatible with v5. Because the generated UUID version is still 3.

Well, yes, of course it sets the version to 3, because it is using MD5. But the UUID(msb, lsb) constructor doesn't.

The question is, how confident are we that the five lines of code you quote are correct. It is different than how it is implemented in nameUUIDFromBytes, so I want to look at it more closely.

himi commented 1 year ago

This article checked the consistency with nameUUIDFromBytes() and the algorithm is simple and transparent. But yes, before using it I would like to check the results.

seidewitz commented 1 year ago

This article checked the consistency with nameUUIDFromBytes() and the algorithm is simple and transparent. But yes, before using it I would like to check the results.

If you want to give it a try, you should be able to just update the implementation of ElementUtil.constructNameUUID. But we need to get it done in the next couple of days.

himi commented 1 year ago

OK. I'll work on it this midnight ;). The implementation will provide options to choose version and algorithm and the original nameUUIDFromBytes().

himi commented 1 year ago

I added UUIDDigest and made constructNameUUID() use it. I tested it by checking the consistency of nameUUIDFromBytes().

Namespace UUID: 48875c49-9ef6-46bc-91a0-2513702fccb0

nameUUIDFromBytes: abcd -> f2c5d3ea-fe44-3cc2-9664-95be2dc85b7e
MD5 V3: abcd -> f2c5d3ea-fe44-3cc2-9664-95be2dc85b7e
SHA1 V3: abcd -> 75083d56-21ff-3a8b-a653-d342e451050a
SHA1 V5: abcd -> 75083d56-21ff-5a8b-a653-d342e451050a

nameUUIDFromBytes: lmda384d -> 2350a46f-f420-3d81-b617-979e14a1c9dc
MD5 V3: lmda384d -> 2350a46f-f420-3d81-b617-979e14a1c9dc
SHA1 V3: lmda384d -> 7d0611f4-869a-38eb-bbc0-6af92ac9268d
SHA1 V5: lmda384d -> 7d0611f4-869a-58eb-bbc0-6af92ac9268d

nameUUIDFromBytes: あいうえ -> 79ee9435-68b6-3a21-91ed-0670b2367d85
MD5 V3: あいうえ -> 79ee9435-68b6-3a21-91ed-0670b2367d85
SHA1 V3: あいうえ -> cdd1a52f-449c-3995-bcbf-ef9ad429f187
SHA1 V5: あいうえ -> cdd1a52f-449c-5995-bcbf-ef9ad429f187

I also checked the result of SHA1 V5 with this website : https://www.uuidtools.com/v5 I confirmed they exactly matched.

seidewitz commented 1 year ago

@himi This looks very nice and seems to work well. I regenerated XMI for all the sysml.library models and get v5 UUIDs for all the named elements - which will now become normative.

Thanks!