GUID get changed anytime we re-create identical objects, which can create inconsistencies with the graph data

BHoM / RDF_Prototypes

Research project of the Cluster of Excellence "Integrative Computational Design and Construction for Architecture" (IntCDC) https://www.intcdc.uni-stuttgart.de/ **Project Name**: Knowledge Representation for Multi-Disciplinary Co-Design of Buildings. https://www.intcdc.uni-stuttgart.de/research/research-projects/rp-20/

GNU Lesser General Public License v3.0

9 stars 4 forks source link

GUID get changed anytime we re-create identical objects, which can create inconsistencies with the graph data #41

Closed DiellzaElshani closed 1 year ago

DiellzaElshani commented 2 years ago

Description:

In order to compare design options (or parts of the changes of the graph), it is important to be consistent with the GUIDs and URIs in the graph. However, following BHoM conventions, anytime we recompute the grasshopper definition, new GUIDs get assigned to objects, followingly also new URIs to nodes. Is there a way to be consistent with object GUIDs? I wonder how you deal with this in general when working with BHoM objects. Or do you query the object one it is created, and not initiate it anymore?

SensorData

alelom commented 2 years ago

We can use a static Hash instead of a Guid, which is directly dependant on the values of the properties of the object. This is how we deal with that in BHoM.

The reason why I did not use a static Hash in the first place is because you said that a design requirement was this (elaborated from issue #40):

when two identical objects are input to the graph, they should not be recognized as the same object, but as two separate ones.

If you introduce a way for object to "not vary an object's identifier if the object did not change", then you end up in the same situation as per #40. In other words, the mechanism that you want here directly conflicts with https://github.com/BHoM/RDF_Prototypes/issues/40.

However, I think we should prefer the usage of a static Hash over a Guid. In other words, I think we should not implement #40, but this issue here instead, by using a static Hash.

The benefits of using a static Hash over a Guid for objects are: B1) The graph stays the same if the objects did not change; it follows that graph-level versioning is simpler. B2) The graph is smaller if identical objects are input B3) A hash is always computable for any object (so it is available also for Geometry objects subject of #40).

The downsides (?) are: D1) Two identical objects will be recognised as the same one D2) Object-level versioning becomes trickier (but this is an advanced concept which is outside the scope of this work).

I think we can live with D1, and actually I cannot see what the issue would be with accepting D1. D2 should not be considered in our conversations as we do not aim to use the graph's objects for direct versioning, but rather we may want to consider the graph as a whole. In other words, I am not sure that D1 and D2 really qualify as downsides for our application.

DiellzaElshani commented 2 years ago

I also think we should prefer the usage of a static Hash over a Guid. If a Point has the same coordinates, it is okay to be considered the same (it is like math 2 is 2). The only problem I am not sure about is: we have named individuals, so in this case a point does not represent only "a point in three-dimensional Euclidean space" but also an object (so it is a node). But still I agree, using a static Hash it better.

alelom commented 2 years ago

Closed in f1aff4ac201bcd346e37dd31bf7694a0b9dfc4f9.

That commit also changes other stuff - I should've made 2 separate commits but I forgot. Anyway, the changes relevant for this issue are the following:

DiellzaElshani commented 1 year ago

I think we should reopen this issue.

While writing my paper, I realized that using HASH is also very risky. Because the objects or individuals should have a static URI. But the HASH is changing all the times we change the inputs of the object.

For example: In a co-design process, an architect creates some objects, converts them to a graph. A structural engineer pulls the graph, converts it to bhom objects adds new properties and pushes these changes back to the graph. This means that the Hash of every object will change now.

We might have two results: -The structural engineers deletes totally the former graph, and pushes the whole building /overwrites. -The structural engineer sends they objects to the same graph, but they end up on wrong places of duplicated because of the new identifiers.

I am wondering if in such situation it is more logical to keep GUIDs as a constructor of the URI and keep the HASH the just as a property.

@alelom please let me know what you think. I see point D2 here already points that out, https://github.com/BHoM/RDF_Prototypes/issues/41#issuecomment-1176311237

@danielhz please share your opinion on this matter.

alelom commented 1 year ago

Let's clear this out:

The hash changes if the object changes.
The GUID is auto-generated randomly the first time the object is created.

For example: In a co-design process, an architect creates some objects, converts them to a graph. A structural engineer pulls the graph, converts it to bhom objects adds new properties and pushes these changes back to the graph. This means that the Hash of every object will change now.

Yes, the hash of every object will change now, as it should. In fact, the objects that the engineer modified are no longer the same objects that the first engineer created. Why is that a risk?

The workflow you are describing in your example is an "Update" workflow, which involves versioning. We haven't even started covering versioning. In other words, let's not confuse:

"object finding" = one URL per object. Each object must have an unique URL
"object versioning" = being able to answer the question "has the object changed?"

If we want to achieve a versioning mechanism when pushing the objects, we need a dedicated Identifier that is retained when the object is modified (it can be a GUID, but also a simple number or anything else). For "object finding", i.e. for the structure of URIs, using a GUID is actually a risk, because you may end up with thousands of different URIs that all point to an identical object on a server (look at the gif you posted at the start of this issue).

The problem you are posing is about creating an Update mechanism for existing graphs where you can trace back modifications of a "same" object. You can achieve this by assigning to the objects an identifier when you push them to TTL. This identifier must then be retained whilst the object is modified, and then also when the object is pushed to TTL the next time. How we deal with this in any BHoM_Adapter is by storing an AdapterIdFragment in the object's fragments. We can do the same or a similar solution here.

alelom commented 1 year ago

How it could work is:

You create some objects
You push the objects to TTL. The TTL contains the objects and every objects is assigned an AdapterId (that can be a GUID)
The next engineer receives the TTL
The TTL is converted to objects. Each object retain the AdapterId assigned previously.
The engineer picks some objects and modifies them. Their AdapterId is retained (but only if they use "Modify" methods which retain the existing properties -- if the engineer creates new objects and intends to replace other objects in the ontology, they must copy the target AdapterId manually!)
Two choices here (both are always possible at the same time): 6.1 The engineer pushes the modified objects to TTL together with the rest of non-modified objects, OR 6.2 (OR alternatively) The engineer pushes only the modified objects specifying a path to the original TTL, which implies that only the modified objects (the ones with the corresponding AdapterId) must be overwritten. Which object is overwritten is determined by the equivalence in their AdapterId.

alelom commented 1 year ago

@DiellzaElshani If this makes sense, I would close this issue and create a new one called "Develop Update workflow for versioning of TTL objects" which we can tackle separately. This issue was more about the right choice for the URI identifier, for which I believe the hash is the best choice as I explained above.

DiellzaElshani commented 1 year ago

How it could work is:

You create some objects

You push the objects to TTL. The TTL contains the objects and every objects is assigned an AdapterId (that can be a GUID)

The next engineer receives the TTL

The TTL is converted to objects. Each object retain the AdapterId assigned previously.

The engineer picks some objects and modifies them. Their AdapterId is retained (but only if they use "Modify" methods which retain the existing properties -- if the engineer creates new objects and intends to replace other objects in the ontology, they must copy the target AdapterId manually!)

Two choices here (both are always possible at the same time): 6.1 The engineer pushes the modified objects to TTL together with the rest of non-modified objects, OR 6.2 (OR alternatively) The engineer pushes only the modified objects specifying a path to the original TTL, which implies that only the modified objects (the ones with the corresponding AdapterId) must be overwritten. Which object is overwritten is determined by the equivalence in their AdapterId.

Having adapterIDs that remain the same sounds like the best solution.

-As we need, we would have a specific ID for each class and property in the graph that doesn't change, even if we read the graph and add new properties to existing individuals.

Also, compared to IFC conversion, the approach of having static IDs fits well. In IFC to ifcOWL conversion, they use the IFC Global Unique Identifier Attribute (GlobalId) to construct the URI in the graph. IFC GlobalId Attributes of elements (assigned automatically by design software) are retained in the exported IFC model (regardless of whether it is a graph or another IFC serializer format).

regarding point 6: 6.2 sounds more reasonable, and it requires less computing time.

alelom commented 1 year ago

@DiellzaElshani sounds good.

Regarding this:

regarding point 6: 6.2 sounds more reasonable, and it requires less computing time.

it's actually the opposite, 6.2 is more user friendly (users do not need to pick and choose the objects to be modified, and they do not need to push the non-modified together with the modified) but it requires some computation; whereas 6.1 is the opposite (but it requires users to always push all modified and non-modified objects, which is less user friendly).
In fact, 6.1 simply writes out a new TTL with all the non-modified and modified objects (this new TTL can simply overwrite the existing TTL). Instead, 6.2 requires to read the entire TTL again just to check which AdapterIds are present in the existing TTL, then a "VennDiagram" with the user-modified objects must be computed, whose intersection are the objects with the same adapter Id ("modified, to be updated"); then, the Push to TTL can happen, which is still an entire overwriting of the existing TTL, but the result is the same: only the modified objects will appear as modified.

DiellzaElshani commented 1 year ago

-Since the end result of both 6.1 and 6.2 is the same I would suggest implementing the one which takes less time to compute. To my understanding overwriting only the objects that are changed, seems like less computational time. But both are fine from my side. If we go with 6.2 I assume there should be a mechanism that checks which objects were modified and replaces only them. However, we should keep in mind that while querying the graph, sometimes we read an object with selected properties (not the whole object). And while overwriting the object we need to be able to add new properties only to the selected properties -edges/nodes. So we do not replace the objects, we add, remove or edit its properties.

-Thanks for hinting the topic of finding the object and versioning. However I think these are interrelated, to be able to check the changes that happening in a certain object. While in a static database, or comparing two static databases it would be easier to keep track of both objects, in a dynamic environment such as a co-design process I assume the two (hash and ID) become interrelated.

I found a paper on the IFC approach "Managing interrelated project information in AEC Knowledge Graphs" that tackles the questions

CQ 1.1 How to semantically describe a property such that its value is changeable while its historical record is maintained?
CQ 1.2 How to revise a property value?
CQ 1.3 How to delete a property while still being able to retrieve the history of it and not break all the links to derived properties that depend on it?
CQ 1.4 How to restore a deleted property?
CQ 1.5 How to retrieve the full history of how the value of a property has evolved over time?
CQ 1.6 How to retrieve only the latest value of a property?

Still our current issue is a static ID for each object in the graph. And the AdapterID sounds like a good approach.