Closed carpentermp closed 11 years ago
I think the bulk of this conversation implementation-specific. Granted, the questions you raise here are issues that every provider is going to need to deal with in some way or another and I think we need to address these questions in the form of some documentation that defines best practices and suggestions. I'm going to leave this issue open until that documentation is provided.
I personally don't think it is sufficient to say that these issues are implementation-specific because of the interoperability problems that arise when implementors make different choices.
Here is my personal take on the specific questions you deal with here.
The “QName role” in RelationshipReference is “denormalized” information, right? The role is defined in the Relationship, so storing it in the reference is redundant and opens the possibility of it not being in agreement with the relationship. If we are putting denormalized information in the RelationshipReference then we ought to be explicit about why we are doing it. Is it to help the client know which relationships to dereference? For example, suppose we want to identify all the children of a person, we would only need to dereference the RelationshipReferences where the role is “Parent”. Is this the reason for the role on the reference?
Yes, that's the intent of having the "role" on the relationship reference, and we need to better document that it's there for convenience to know which relationships to dereference and that the possibility of inconsistent data is exists if a provider has a buggy implementation.
My suggestion would be that we don’t allow Relationships to be created without two Persons and a “type”, and that these three pieces of information are immutable.
I agree with that implementation, but (again) this is implementation-specific and there's no way for the model to actually enforce these rules.
I believe that the current state of the model implies the following: When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model ...
The git-like model for conclusion data is obviously not proven out yet. But I don't believe that relationship references precludes someone who wants to implement a git-like editing model from doing so. In their implementation, they would simply have persons that don't have any relationship references and work out for themselves all of the implications of that.
When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of...)
I absolutely agree that the notion of "watching" a person should include notification of an event/characteristic being added to a relevant relationship. But (again) this is implementation-specific. It's all about how a "watch" is implemented. The model doesn't impose that a watch on a person can't include notification of marriage events.
When a Person is deleted, the Relationships the Person is involved in are also deleted.
I would hope so, but (again) implementation-specific.
When a Person is merged with another Person, what happens to the Relationships?
Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.
I have some follow-up comments on your comments...on my comments :)
that the current state of the model implies the following: When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model ...
The git-like model for conclusion data is obviously not proven out yet. But I don't believe that relationship references precludes someone who wants to implement a git-like editing model from doing so. In their implementation, they would simply have persons that don't have any relationship references and work out for themselves all of the implications of that.
I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships.
If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.
When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of...)
I absolutely agree that the notion of "watching" a person should include notification of an event/characteristic being added to a relevant relationship. But (again) this is implementation-specific. It's all about how a "watch" is implemented. The model doesn't impose that a watch on a person can't include notification of marriage events.
Again, I believe this has to be spelled out for proper interoperability. If one system considers the Person modified and another doesn't, this will have very unpredictable behavior to users. Imagine a client displaying a given person making a HEAD request or some such thing to see if the person's data needs to be updated. The server says the Person is not modified, but some of the data being shown is now different. That's not going to be a good experience.
When a Person is deleted, the Relationships the Person is involved in are also deleted.
I would hope so, but (again) implementation-specific.
You are suggesting that one system may delete the relationship, where another may just delete a person from the relationship. This implies that one system may allow relationships to be created with a single person, and another may not. A generic client, operating against both systems will have a difficult time knowing what is going to happen when a given write operation is chosen. This difficulty will likely be passed on to users of the client. Again, if you don't specify the correct system behavior, it will impair interoperability of services with clients and it will be a bad experience for users.
When a Person is merged with another Person, what happens to the Relationships?
Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.
It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?
I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships. If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.
You're assuming the git model has a REST interface, but the git-like interface is much different from a REST interface. You don't just "get" a person like you would in a REST interface. You "get" a repository of data.
I believe this has to be spelled out for proper interoperability.
Agreed. Spelled out in the definition of the watch endpoint.
It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?
Umm... I'm not sure what you mean.
When a Person is merged with another Person, what happens to the Relationships?
Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.
It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?
Umm... I'm not sure what you mean.
My question came about because of the large number of things that you are saying are "implementation specific."
"Read interoperability" is achieved with the existence of a standard model and protocol which allows clients reading the data to all understand the data in a common way and to predictably navigate the connected resources. It also allows a single client to read data from equally well from different providers.
"Write interoperability" is achieved with the existence of a standard model and protocol for modifying the data, specifying all the different ways that the data may be modified and what is the expected result. This allows generic clients to be developed that can let users do things like "add a conclusion to a Person" or "add an interpreted value to a field on a Record".
Most of my comments on this thread have been with the underlying assumption that we would be attempting to achieve both read and write interoperability. You wrote that some systems may not support "merge". That limits write interoperability since merge is a modify operation. Realizing this, it finally occurred to me that you may not have write interoperability as one of the goals of GedcomX.
While I can see that we may want to go for read interoperability as a first step, if we are ever hoping for write interoperability I believe we need to keep it in mind now, because it will affect modeling decisions. Let's take an example. For read interoperability, you have to decide whether or not multiple Relationships of a given type are allowed between the same two persons (what I commonly call the "uniqueness constraint" for Relationships). It has to be specified one way or the other so that clients can know whether or not they have to deal with it when reading relationships. Considering only the read implications it may seem like an o.k. thing to dispense with the uniqueness constraint. Now suppose "System A" does not enforce the constraint and "System B" does. Later, we decide to go for write interoperability. System A sends data to System B and System B is either forced to reject some relationships or merge them because of its uniqueness constraint. Compatibility suffers.
I can appreciate you desire to let some things be implementation specific, rather than specifying everything. On the other hand, I'm on the lookout for those things that are going to adversely affect interoperability and I'm trying to get them tied down. I would much rather come out with something that is too constrained and let the community urge us to relax the constraints than come up with something that is too loose so that the community suffers from poor interoperability ever after.
I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships. If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.
You're assuming the git model has a REST interface, but the git-like interface is much different from a REST interface. You don't just "get" a person like you would in a REST interface. You "get" a repository of data.
Yes, but I assume we are still talking about resources that have REST endpoints (as well as possibly GIT-like service endpoints)? For example, they want this GIT style API work for CP data (soon to be called CT). CP data is still hosted online, and still has REST endpoints for Persons. Do those REST endpoints have RelationshipReferences? When a RelationshipReference is added to a Person is the person "modified"? (.e.g if I GET it with a conditional GET, does it come back?) If the answer to these questions is "yes" then I don't see how that is compatible with the GIT-like service that is coexisting on the same data.
I believe this has to be spelled out for proper interoperability.
Agreed. Spelled out in the definition of the watch endpoint.
Watch endpoint? I'm talking about basic REST behavior: versioning, modified date, caching, conditional GET, stuff like that.
In order to revive this thread, and to illustrate unresolved issues with regard to Relationships in the Conclusion model, I thought it would be helpful to contrast two approaches at opposite sides of the spectrum. As I do this, the reader will note that positions in the middle between the two approaches are possible.
I will characterize the two approaches thus:
Also please be careful to note, as you go along, the following distinctions:
For clarity, let me begin with a definition. I'm defining "independent entity" as an object where the identity of the object is independent of the object's attributes. Our usual approach with independent entities is to let the system assign a unique identifier. So what is the distinction between "independent entity" and "plain old entity"? Domain Driven Design by Eric Evans has a chapter on entity modeling. He defines "entity" this way:
Some objects are not defined primarily by their attributes. They represent a thread of identity that runs through time and often across distinct representations. Sometimes such an object must be matched with another object even though attributes differ. An object must be distinguished from other objects even though they might have the same attributes. Mistaken identity can lead to data corruption (p91).
In this definition Eric seems to leave open the door for some entities to be "partly" defined by their attributes. In that case, a Relationship that is defined by the people involved and the relationship type (position 2) might be called an "entity", but would still not be considered an "independent entity" as I have defined it.
In the Conclusion profile of GedcomX, A Person is meant to represent a real person who lived on the earth. Real people have identity independent of their name, or anything else we know about them. Regardless what we come to believe about the person (as represented by the Person), the identity of the person (little p) remains constant. Two systems may each have a representation of the Person (with different attributes) and they may want to match them up (perhaps to let a user do a comparison, or perhaps for synchronization). For these reasons (and several more that could be given), Person makes an excellent "independent entity". (Of course, a Person's identity can always be "hijacked", meaning that the data belonging to one real person can be replaced with data belonging to another real person. I'll get into how hijacking relates to relationships a little later on.)
Now a few words about how Persons behave in the system will be useful in preparation for considering the two approaches to Relationships.
Person | |
identified by: | |
|
|
meant to represent: | |
|
|
has: | |
|
|
participates in: | |
|
|
merge: | |
|
Now let's contrast the two approaches to Relationship.
Relationship as independent entity (RAIE) | Relationship defined by persons and type (RDBPAT) |
identified by: | |
|
|
meant to represent: | |
|
|
has: | |
|
|
mutability of Person participation | |
|
|
Relationship uniqueness constraint | |
|
|
merging | |
|
|
Source attachment | |
|
|
Person "awareness" of participation in Relationships | |
|
|
From this table it is easy to see that there are a significant number of fundamental differences between the two approaches. Let's consider each of these differences:
A genealogically sound system provides a clear, unambiguous way for users to make conclusions about things of genealogical significance. The system tracks what they say, who says so, when they said so, why they believe it, and where they got their information. We have already mentioned that the "identity" of a Person may be "hijacked" by replacing the data belonging to one "real person" with data belonging to another "real person". Person hijacking tends to be fairly rare, but is pernicious when it occurs. Why? Primarily, it is because it modifies the meaning of conclusions, making it appear that contributors believed something that they never believed.
Let's consider an example that illustrates this. Suppose, through my research, I determine that Bob, son of Joe and Sue, was born in 1800 and I conclude this by putting a "Birth Fact" on Bob. Suppose someone else hijacks Bob and uses him to represent Tim, Bob's little brother, who was born in 1802. (Suppose he does this by noticing that Tim has Joe and Sue as parents and so he changes the name from "Bob" to "Tim" and voila! job done.) Tim, (from when he was Bob) still has a Birth Fact where I state that Tim was born in 1800! Of course, I never believed this, but there it is in black and white that I do. This example illustrates that, when a Person is hijacked, everything leftover from before the hijacking now no longer applies and instead, lies. This includes Names, Gender, Facts, Sources, Relationships, everything. Some of it might still be true, but none of it still says what it was intended to say. (As in the example, Joe and Sue were Tim's parents, but when they were created they stated that they were Bob's parents.)
Fortunately as I said, Person hijacking is rare, and now that we will be tracking the who, when, why, and where of every conclusion (the 5 W's), is likely to become even more rare. In the past, Persons would gradually accumulate information from more than 1 real person (because of a lack of the 5 W's, it has often been hard to accurately determine which real person the Person is meant to represent) until the "real person" eventually migrated from one person to someone else.
So how does all this relate to "mutability of Person participation" in Relationships? When you allow users to change who participates in a Relationship, you have done nothing more than give them a way to hijack the Relationship! Another example illustrates this. Suppose there exists a "couple" Relationship between Joe and Sue. From my research, I determine that Joe and Sue were married in 1799 and I add a "Marriage Fact" to the existing Relationship. Suppose someone else decides that it wasn't Joe and Sue that were a couple, but Joe and Ann, so he changes the "woman" in the couple Relationship to point to Ann. When I come back, I see that Joe and Ann were married in 1799--and I'm the one who says so!
The existence of a Relationship is, in and of itself, a genealogical conclusion. If I say "Joe and Sue were a couple", that carries genealogical information. I don't need to know when (or if) they were married for the information to be worthy of capturing the 5 W's. Similarly with parent-child relationships. If I say, "Bob was Joe's son," that's a significant genealogical statement even without knowing for sure if the relationship was biological or adoptive.
What of "Relationship Facts"? How do they relate to the "Relationship conclusion"? When I put a "Marriage Fact" on a Relationship I am "fleshing out" the original conclusion e.g.
From this we can see that the marriage conclusion is subordinate to the original conclusion. If the original conclusion (Joe and Sue were a couple) is no longer believed, it is impossible to continue to believe the marriage conclusion (Joe was married to Sue...). Thus, if you change the original conclusion, you destroy all subordinate conclusions. (The astute reader will notice that this is just another way of describing "Relationship hijacking.")
Because of the problems cited with mutable Person participation, immutability is the only genealogically sound option. Relationships should be created specifying the Persons involved and the type. These three pieces of information from the original conclusion of the Relationship. Relationships may be fleshed out with subordinate conclusions that are in agreement with the original conclusion. When the original conclusion is no longer believed, the only genealogically sound option is to delete the Relationship and create a new one according to what is now believed to be true.
The uniqueness constraint on Relationships goes hand-in-hand with the definition of what Relationships are meant to represent. Without the RUC you are forced to define each relationship as one of possibly many "chapters" in the actual human relationship. As a concept, this is sort of defensible with couple relationships where people can get married, divorced, and remarried. It's a real stretch, however, for parent-child relationships. Also, the concept seems to serve no purpose in the model. All that can be understood from multiple "relationship chapters" can be understood equally well by multiple marriage and divorce events on a single Relationship object.
Furthermore, the concept brings additional complexity for the system and for users:
I really can see no value to the "relationship chapter" definition of Relationship and a huge downside. Truly, the Relationship uniqueness constraint is our friend.
A question that might be asked is, "why is it better to identify Relationships by the Persons involved and the type instead of an arbitrary system-assigned unique identifier?". The simple answer is, "it's the longest-lived, most robust identifier that can be used." A an example will illustrate this.
Suppose I add a "marriage Fact" to the Couple Relationship between Joe and Sue. After a week I come back and see that the marriage Fact I added is not there. I check the change history and there is no record of the marriage Fact I added. I think, "the system is broken." What happened? Well, someone came along and decided that Joe and Sue were not a couple after all so they deleted the Relationship (the one that had my marriage Fact on it). Someone else came along and decided that, yes, they were a couple after all, so they re-created the Relationship--only this is a "new never seen before" Relationship, with a new change history.
This is one of the fundamental problems with "Relationships as independent entities." Without the Relationship uniqueness constraint, there is no way around this problem. Even with RUC, the problem can happen as I described. However, it must be admitted that, with RUC, there is a way around the problem. Whenever a Relationship is created, the system checks for the existence of a "tombstoned" Relationship and "resurrects" the Relationship, when found (thus preserving the change history). Once you are doing this, however, you are just using the system-assigned identifier as a pseudonym for the RDBPAT identifier. You've accepted that Relationships are defined by the Persons involved and the type. Unfortunately, you haven't solved all the problems that come with the system-assigned identifier. Here are a couple more:
The net effect of these to differences is that in RDBPAT, Relationships are much simpler to deal with.
Another area of difference between the RAIE and RDBPAT models is in attachment of sources. In RAIE, sources are attached explicitly by users, in RDBPAT attachment is done automatically by the system. To ask users to attach sources to Relationships is to needlessly complicate their life.
To illustrate my meaning, let me give an example. Suppose I find a birth record that shows that Joe and Bob are father and son. Suppose I also find Joe and Bob in my conclusion tree and decide to attach the sources. I attach source-Bob to conclusion-Bob and source-Joe to conclusion-Joe. Having done this, the system is able to infer that source-parent-child-relationship supports the conclusion-parent-child-relationship. It was not necessary to ask the user to do this.
In fact, asking the user to do it will probably create chaos. Suppose I attach the source-relationship to a different conclusion-relationship--what does that even mean if the source Persons don't agree? When they don't agree, it's just a mess--and one that we will have to ask our users to clean up. When they do agree, it's redundant, and so pointless.
From the outset it must be noted that, where up to now most of the discussion has been primarily GedcomX model focused, this is primarily a GedcomX webservice discussion.
To begin this discussion I will describe a use case that I feel the web service should support without difficulty: suppose a client wishes to download a portion of the tree (set of Persons) and thereafter "stay in sync" with changes made to that portion. To support this, the webservice will, at minimum, need to be able to answer these two questions:
Now one question that must be answered is, "when fetching a Person, what is returned?"
Consider the following in the abstract: for any given person (little p), if I say, "tell me what you know about that person" you would certainly tell me the person's name, gender, birth date, and such. If the person had been married, would you tell me when and where and to whom, if it was known? Of course you would. Obituaries are a great example of this. They are all about a single person--recently deceased--and are filled with relationship information. Suppose yesterday I asked you, "tell me what you know about Bob" and you told me. Suppose later that day you discovered that Bob had another child that you didn't tell me about when I asked. Suppose I ask you today, "has anything you know about Bob changed since I asked you yesterday?" Would you tell me about the new child? Of course.
So, with respect to question 1 above, "everything you know about a given person" includes the Relationships he participates in, and the Relationship Facts of any of those Relationships.
With this as a backdrop we are ready to define "Person awareness of participation in Relationships", then we can consider what "awareness" means for our synchronization use case. To say that a Person is aware of involvement in Relationships is to say that, whenever a Person is added to, or removed from, a Relationship:
There is an additional level of awareness where the Person is modified, and a change is recorded, whenever a change is made to the contents of any Relationship the Person participates in e.g. the Person is logically modified when a Marriage Fact is added to a couple Relationship he participates in.
From our table on the two approaches to Relationship above, we remember that RAIE has neither type of "awareness", and RDBPAT has both. Now let's consider the implications of this for our synchronization use case and the 2 questions that the webservice has to be able to answer:
Relationship as independent entity (RAIE) | Relationship defined by persons and type (RDBPAT) |
1. Tell me everything you know about a given person: | |
|
|
2. Tell me what's changed about a given person since last I asked: | |
|
|
Person awareness of participation in Relationships would seem to be an important requirement. It not only saves a great deal of trouble for clients, having a "deleted Relationship" change in the change history gives the client a handle to deleted Relationships so the delete can be inspected and/or undone. Without such a change entry, that functionality is very awkward, if not impossible.
As for 2nd-level awareness (where change entries are put into the Person change history for Relationship Fact changes) it's probably not as hard a requirement as basic Relationship awareness, but it sure simplifies the client's life. Instead of fetching a Person and then doing a multithreaded fetch of all the Relationships he participates in, a single Person fetch suffices. Also, when asking for "changes since", in RDBPAT a single Person-change fetch suffices.
Another idea that has been proffered is that many of these issues can be left as details for each system implementer to choose. I have been operating under the assumption that we want to define GedcomX such that two systems, built by different parties, could potentially synchronize changes with each other in both directions. Interoperability of this kind is not possible without these aspects of the model being part of the specification. For this kind of interoperability, it is necessary to specify all the ways that the data may be modified, and what the results of those operations are expected to be. Also, to be genealogically sound, each change has to record the 5 W's.
For example, it is not possible to leave the question of mutable Person participation as implementation-specific. If these changes are to be allowed, then both systems have to support them in order to synchronize, and there has to be a place in the model to record the 5 W's for these changes.
As another example, consider RUC. As we have already explained, if one system enforces the RUC and another does not, the two systems actually have a different model for Relationships. Or stated another way, Relationship means something different in each system. With such different Relationship models, it is impossible to round-trip Relationships between the two systems.
I have been uneasy for a long time that there are issues with the Relationship model that have needed addressing. As a catalyst to addressing them, I have described two very different models and I have tried to describe the consequences of these differences to users and to system implementers. I hope this will foster the kind of dialogue that will ultimately result in a resolution of the issues.
Wow. That's an impressive essay. Thanks for all that work.
I see some holes, though. It's going to take me some time to sift through and pull together a worthy response.
I think it would be helpful to summarize all the changes that would be applied to the model if your argument were wholly accepted without dispute. I think it looks like this (correct me if I'm wrong):
Relationship
would not extend GenealogicalEntity
.Person
would contain a list of Relationship
s.Relationship
would just refer to one Person
since the "source" Person
is identified by the context of the Relationship
.Anything else?
Thanks for reading it!
I was deliberately vague about specific model changes because I wanted to get conceptual understanding first, and I figured that if I started with the changes implied by my arguments, that might produce an emotional response in some people that would cause them to reject the arguments without first hearing them out.
Your summary of the model changes I would propose is pretty accurate, but it should be noted that I made arguments in favor of several semi-independent things, and some of them don't require model changes, per se. Here are the main points and the model changes that would be inferred by them:
The last main point had to do with "system-attached" vs. "user-attached" Relationship sources. As I considered the model changes implied by my arguments on this topic I realized that the light treatment I gave the topic will really not suffice to explain the changes I would like to see. I would like to correct that now with a general discussion of source-linking.
Let's start with the list of source links on Person. With any model it is important to be able to talk about the meaning of each of the model's parts. What, then, does it mean when a Person links to a source? My answer would depend slightly on what is being linked to. I would characterize the meaning this way:
The last two meanings in this list are the essence of "source-centric" genealogy that @ranbo has been promoting lo these many years.
Now I noticed that, in the model today, "Conclusion" also has a list of sources, so we have to answer: what does it mean when a Fact, for example, links to a source? I suppose:
Is it ever the case that a source has "something to do with" a Fact without also having "something to do with" the Person where the Fact resides? No, that makes no sense e.g. Anything having to do with Bob's birth fact, by definition, must also have something do do with Bob. Thus, it makes no sense for a Conclusion to list a source that is not also listed in the Person. The current model, however, with its independent lists of sources on Person and on Conclusion, would seem to be perfectly comfortable with this logical contradiction. So this is the first thing about the source model that I would like to refine.
Secondly, I view attaching a source as a "genealogical statement" akin to a "Conclusion" in the model. If I say, "this resource has something to do with Bob," or if I say, "the person mentioned in this source is the same real person as the real person represented by my tree Person Bob", I am saying something of genealogical significance, something that could be disagreed with, and something that could take some justification. Thus, it is important to capture the 5 W's of that statement. The current model's simple list of ResourceReferences gives me no way to do this.
Thirdly, you may have noticed a lack of symmetry in my definitions of what it means to attach sources-to-Persons vs. what it means to attach sources-to-Facts (or other Conclusions). For Persons, we attached special significance to attaching Personas and other-tree-Persons. Is there not an analogue with Facts? Would it not be logical to give the same kind of status to attaching source-Facts and other-tree-Facts to Facts? Perhaps I want to say, "the Birth Fact in this source represents the same real-life-birth-event as the birth Fact in this Person?" Is that not logical? Yes, but it turns out that by attaching Person-Bob to Persona-Bob you have already made this statement. If Person-Bob and Persona-Bob represent the same real person, then any birth Facts on either object must be talking about the same real-life-birth-event--Bob's birth.
Thus, we see that, when attaching Personas or other-tree-Persons to Persons, the system is able to understand everything the source believes to be true about the real person, and this information can easily be compared with anything the tree-Person says about the real person. For the most part, it is not necessary or helpful for users to explicitly attach source-Facts to conclusion-Facts (or source-Relationships to conclusion-Relationships, which was my original point in my earlier post).
Having said this, however, it must be noted that we still need a kind of "genealogical statement" that we have not yet provided for. It is often the case that, while we believe the Persona and Person represent the same real person, we disagree with a piece of information in the source. For example, it's fairly common for a death record to give a date of birth and occasionally the birth date is wrong. We may have other information about what the correct birth day is, and so we disbelieve what the death record says about the birth date. In order to avoid countless re-evaluations of the discrepancy between the conclusion-birth-Fact and the source-birth-Fact (implied by attaching the Persona to the Person) it will be important to provide a way for users to acknowledge such discrepancies when making conclusions so that they can be "put to rest."
So this brings us at last to how I would propose changing the model for source-linking. To solve my second concern (Relationship as conclusion), I would:
Then, to solve my first concern (independent lists of sources in Person and Conclusion), I would:
I would probably stop here, and leave my third concern (accounting for source data that disagrees with my conclusions) to another version of the standard. However, to give an idea of what I have in mind, let me say that, if I had to try to solve my third concern today, I would:
There are some excellent points in the above proposals. One that I want to throw my entire weight behind is this:
A Relationship should represent everything we know about the relationship of that type between those two people.
This implies the relationship uniqueness constraint, which in turn implies that relationships must be automatically merged when the people on both ends are merged. This is what has been done in new FamilySearch for years, and is one thing that has been working just fine. Yes, merging and relationships are both complex areas, but having multiple relationships of the same type between the same people makes it more complex without adding value. I also agree that the model may not need to change to support RUC, but the documentation should explain clearly that this is what a Relationship means. The concept of having a separate relationship for each "chapter" (time span), or each event, or each source is a horrible complication for clients and users.
I continue to seek a big block of time to formulate a point-by-point response to @carpentermp's essay, but I just can't find it. So I'm afraid my response here is going to lack the detail that is deserved by such a great essay. But the essay deserves some timeliness, too, so here we go...
The whole "discussion on person awareness" seems to be based on an erroneous assumption that I think needs to be clarified. Entity boundaries as defined by the model are NOT the same as the boundaries for web service resources.
Your use case about a client needing to stay in sync with changes made to person simply argues for the need for defining appropriate web service resource boundaries to make sure you can do that conveniently. Great. I totally agree. For the case that you're most interested in, we'd want to define a "person with relationships" resource that will provide a resource that includes a person and all relevant relationships, and we'd make sure that the cache validation (e.g. Last-Modified) was appropriate for that resource. Just because the entity boundaries are defined as they are today doesn't mean that we can't define a "resource" with wider boundaries.
- Relationship would extend Conclusion or GenealogicalResource instead of GenealogicalEntity.
- Person would contain a list of Relationships.
- Relationship would just refer to one Person since the "source" Person is identified by the context of the Relationship (and there must be a way of distinguishing parent-child Relationships where the Person is a Parent from those where he is a Child).
This is totally baffling to me. I can't see how this would be an improvement to the model, even given all your arguments for RDBPAT, which I generally agree with. I see nothing but trouble with this suggestion. We'd have to first write up all the extra rules that say that relationships on the left-side person match the relationships on the right-side person, and then we'd have to write up all the rules for what to do when the relationships on the left-side person don't match the relationships on the right-side person. And in many cases, there's no good way to resolve the differences, such as when a relationship fact exists on one side but not the other (do I remove it? do I add it to the other?). Yuck!
We'd be imposing that developers implement relationships as entities in their back-end, but the model would require denormalization. This is just bad modeling practice.
The current model, however, with its independent lists of sources on Person and on Conclusion, would seem to be perfectly comfortable with this logical contradiction. So this is the first thing about the source model that I would like to refine.
Actually, the existing model intends that conclusion source references point to source references on the person. That needs to be more explicitly clarified, though.
Thus, it is important to capture the 5 W's of that statement. The current model's simple list of ResourceReferences gives me no way to do this.
You're absolutely right. There needs to be a SourceReference
object that extends ResouceReference
that contains all the "5 W's" stuff. Thanks for pointing that out. It was always intended to be there, and I'm pretty sure it was there at one point, but I don't know where it went. Disturbing...
allow negative genealogical statements.
I agree. Let's open up and issue to track that.
SourceReference
again.New issues have been spawned as children of this issue at #120, #125, #126, and #127 so their progress can be tracked independently.
Obviously, this issue is still active. We'll use it to track the formal definition of the RDBPAT concept.
I have been working on this response for some time, and it's not really done, but after yesterday's meeting I thought it would be a good idea to post it, as is, for the benefit of the community since it contains much of what was discussed. I don't expect it will be very influential (it certainly wasn't yesterday), but here goes nothing...
As I consider my efforts to persuade on this issue I realize that some of our differences of opinion may come from different design objectives, or from a different emphasis on them. Perhaps making those design objectives explicit will help.
The following were my top design objectives:
I did not intend the order of these objectives to be significant--the desire is to achieve them all. Of course, there are always tradeoffs. Achieving the optimal balance of all the objectives is the ultimate goal. This is admittedly difficult, but it should be possible to compare and contrast any set of options with respect to these design objectives.
Now that I have listed my key design objectives, I am ready to respond to your comments. Let me begin with this from your post:
Entity Boundaries != Resource Boundaries
The whole "discussion on person awareness" seems to be based on an erroneous assumption that I think needs to be clarified. Entity boundaries as defined by the model are NOT the same as the boundaries for web service resources.
I'm really glad you brought up this point because there may be a difference in our thinking here that has been leading us down different paths. By exploring it perhaps our paths can be brought closer together.
The client of a web service experiences the model through the web service. The resources of the web service, and the methods of manipulating them are the only model the client knows. When you define a resource in a RESTful web service, from the point of view of REST (and the client), you have created an entity. Why do I say this? You gave it a unique identifier (URI). You may like to consider these REST entities as a different kind of entity than model entities, but they are the only kind of entity that REST naturally understands. Whenever you assign a URI to something, you are forced to answer all the usual REST entity questions:
So, why have I taken such pains to characterize resources as "REST entities"? Bear with me. I am trying to lay the groundwork that will allow issues to be brought into focus.
Having defined these two kinds of entities, we could rephrase your initial statement this way:
Model entity boundaries != REST entity boundaries
Given that we have agreed that there needn't be a one-to-one correspondence between model entities and REST entities, it would seem that we differ only on how to deviate from this to best advantage.
For example, with RDBPAT I really wasn't trying to abolish Relationship as a model entity--I just didn't want it expressed as a REST entity. I wanted to avoid creating a system-assigned unique identifier (which, when expressing Relationship as a REST entity, is the URI). Instead, I wanted to be sure that, whenever a Relationship is identified, it is identified by the two people involved and the type. The advantages I cited for identifying Relationships this way were:
I proposed a REST entity like this (pseudo-xml):
<person persistentId="...">
<facts>...</facts>
<sources>...</sources>
<relationships>
<parent ref="...">
<facts>...</facts>
</parent>
...
</relationships>
</person>
As I said before, with this proposal I wasn't trying to change the model entities. I just wanted to change how they are expressed through the web service. (For example, when a client consumes an entire tree at once, even I can see that having Relationship outside of Person is better, since the data is then fully normalized.)
You countered by suggesting a "person with relationships" resource. I suppose it would be logically something like this:
<person-with-relationships>
<person persistentId="...">
<facts>...</facts>
<sources>...</sources>
</person>
<relationship type="parent-child">
<person1 ref="..."/>
<person2 ref="..."/>
<facts>...</facts>
</relationship>
...
</person-with-relationships>
In the above example, note that the Relationship has "person1" and "person2", which have URI's to their respective Person's. If I do an HTTP GET on these URI's, what comes back, the simple <person>
or <person-with-relationships>
? If the latter, and if the web service doesn't provide a URI to individual Relationships, then we are free to have Relationship extend GenealogicalResource instead of GenealogicalEntity and it wasn't necessary to embed Relationships inside of Person. Are you more comfortable with this approach?
For the purposes of illustration, let's explore for a minute the other option, where Person URI's return simple <person>
. In that case, to be fully connected I suppose we'll need a link to the <person-with-relationships>
in <person>
, e.g.:
<person persistentId="...">
<facts>...</facts>
<sources>...</sources>
<links>
<link rel='person-with-rels' href="..."/>
</links>
</person>
With this approach, to get to the Relationships of a Person, I'll end up fetching the <person>
, then fetching the <person-with-relationships>
. Once I have done this I have fetched the contents of <person>
twice. This affects all three aspects of cacheability:
<person>
information twice.<person>
information is sent twice<person>
has to be cached twice. This will cause the cache to reach it size limits sooner, forcing more objects out of the cache without going stale, which will further impact server and over-the-wire costs.This is a small illustration of a problem with what I call "bags". I define "bag" this way:
bag: a REST entity that aggregates other REST entities.
<person-with-relationships>
is a small bag because it aggregates <person>
.
The small bag example cited above is only one of a variety of bags that I have seen discussed and some of which are in the CT api that was presented at RootsTech this year. For example, in the CT API each "conclusion" has its own URI and can be operated upon (HTTP GET, DELETE) independently. By giving each conclusion its own URI, even simple <person>
becomes a bag.
A couple of other common bags I have seen contemplated are the "bowtie" bag, (give me the person and his 1-hop relatives) and the "n-generations" bag (give me the ancestors/descendants of a given person to "n" generations). These bags may contain whole persons, or just "person summaries" but either way, they have huge implications for caching.
Let's consider the caching implications of the "bowtie bag". Suppose a man has three children, two parents, and a wife (who also has two parents). In this simple example, the "bowtie bag" person information for that man (whether a complete Person, or a SummaryPerson) is available in 9 different bowtie bags! Over time as a user navigates this tree he is likely to cache each of these bowties, as well as the individual
Let's review the problems associated with "bags" and "summary objects":
Problems with bags
Problems with "summary" objects (in or outside of bags)
This leads us to the following rule of thumb:
In general, each piece of information should have a single way of being retrieved/modified. Where a compelling need can be clearly demonstrated, this can be relaxed, but there is a strong burden of proof placed upon anyone wanting to recommend relaxing this rule.
Bug scrub bump.
There is a lot of water under the bridge since this issue has been opened. These concerns are true, but this project is no longer the place to get them addressed. They are, generally, concerns of the application, so we need to pick up these issues over at FamilySearch/gedcomx-rs and hash out many of them within our FamilySearch implementation teams.
The “QName role” in RelationshipReference is “denormalized” information, right? The role is defined in the Relationship, so storing it in the reference is redundant and opens the possibility of it not being in agreement with the relationship. If we are putting denormalized information in the RelationshipReference then we ought to be explicit about why we are doing it. Is it to help the client know which relationships to dereference? For example, suppose we want to identify all the children of a person, we would only need to dereference the RelationshipReferences where the role is “Parent”. Is this the reason for the role on the reference?
If we are putting this kind of denormalized information on a person we need to be clear how we plan to keep everything self-consistent. My suggestion would be that we don’t allow Relationships to be created without two Persons and a “type”, and that these three pieces of information are immutable. (I know this flies in the face of Rontel’s opinions on the subject). This gets into all the life-cycle questions between Person and Relationship that need to be clearly defined and documented.
Though we have not yet been very explicit about this, I believe that the current state of the model implies the following:
• When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model that John Sumsion et. al. are wanting, since their model requires a “directed-acyclic-graph”. It may be worth exploring their ideas some more before we shut the door on them, I don’t know. We would also need to explore what it would mean to the model if Persons didn’t have any explicit knowledge of the Relationships they participate in—or in other words, what if Persons were NOT modified when a Relationship is created that refers to them? • When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of. I believe this is one of the side-effects of having Relationships as entities and now opens up the need for users to explicitly “watch” Relationships as well as Persons. To me, it would seem much more simple and natural for a user to “watch” a Person and be notified of any change to any Relationship that the Person participates in.) • When a Person is deleted, the Relationships the Person is involved in are also deleted. • When a Person is merged with another Person, what happens to the Relationships? (Is the “uniqueness constraint” part of the model? If so, then some merging of Relationships would seem to be implied. If not, then systems are free to do this differently. Unfortunately, this becomes an impediment to interoperability since one system may allow multiple relationships of the same type between the same two people, and another may insist upon uniqueness. My opinion is that this is the type of thing that a well-defined model is supposed to guard against and ought to be clearly specified.)