FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
359 stars 67 forks source link

add links to base models and get rid of 'www' extension models #4

Closed carpentermp closed 13 years ago

carpentermp commented 13 years ago

This approach of having a “www” flavor of an object that has “links” seems to be getting more and more problematic. It is exploding the number of classes (because lots of objects now have a “www” counterpart). Also, the WWW flavor of “Person” is hard to consume because it extends “ordinary Person” which has lists of “ordinary Name”, “ordinary Event”, etc. Your comment that says you will change these to be lists of “? extends {object}”. This helps, but won’t entirely solve the problem because anyone working with a WWW Person, when iterating the Names, for example, will have to check each Name to see if it is a “WWW Name” or an “ordinary Name”. If it is a “WWW Name” then the links are available, but if “ordinary Name” they are not. Yuck.

I propose we just have one set of objects, with Links, and that we ask the Data-Framework guys to just swallow the pill that their Record’s have a place for “Links” that they won’t be populating. I know they have pushed back hard on this, but sometimes concessions need to be made for the overall good.

stoicflame commented 13 years ago

+1

dkohlert commented 13 years ago

-1. It would be much cleaner and simpler if they two models do not share classes but simply map between each other.

jeffph commented 13 years ago

-1

We also have attributes that are specific to our context (e.g. databaseId, version, timestamps, etc.) that require us to extend/map the shared model. However, we understand it doesn't make sense to force you guys to have those attributes.

carpentermp commented 13 years ago

databaseId? What is that? As for version and timestamp, they belong in both contexts.

carpentermp commented 13 years ago

After seeing the backflips we are having to do in order to extend the "base" profiles, I agree with Doug that, if we must keep the separation, then the best thing to do would be to not extend at all, but just create two separate models and provide a mapping between them.

jeffph commented 13 years ago

Sorry, I should've explained those examples in more detail. "DatabaseId" is simply a numeric value that is our PK in the database. There are other id values (foreign keys) in our extended class that we wouldn't want in the shared model also.

carpentermp commented 13 years ago

Jeff, it seems you may be referring to implementation-specific needs that virtually every implementation will have. Of course it would not make sense to add such things to a standard model. I believe we are weighing the pros and cons of these two options:

  1. having different-but-strongly-related standard models for genealogical data that is web-hosted vs. bulk-exchanged.
  2. having a single unified model for all genealogical data, whether web-hosted or bulk-exchanged.

So far, we have been trying to make the "bulk-exchanged" model a pure subset of the "web-hosted" model, but it is making a mess of our classes and our inheritance model. If the superset-subset approach isn't worth it, then we are forced to either unify the models, or map between them. In order to choose wisely, we all need a clear understanding of the pros and cons of the different approaches. Let me provide a cursory list of what I perceive to be the tradeoffs that I hope everyone will fill in with more detail:

Separate-but-related-modes pros and cons

The primary advantage of having separate-but-related models would appear to be that, for those whose purpose is the bulk-exchange of genealogical data, the model they deal with is simpler and tailored to their purpose. In general, it contains only what they need. Disadvantages are either:

Unified Model pros and cons

The advantages of a unified model are the disadvantages of the other approach and vice-versa. The primary advantage is a single model to develop, describe, and maintain. The primary disadvantage is experienced by applications where bulk-exchange is the only usage of the data. These applications experience a model that has elements in it that may have no applicability for their bulk-exchange operations.

The weight that we tend to assign to each of these pros and cons tends to depend upon where we spend most of our time. If you are developing an application that primarily does bulk data exchange, then having a model tailored to that purpose seems really important and having to maintain two models doesn't seem very bad. If you are maintaining the model, or if you spend you time on web-hosted genealogical data, then having a unified model seems really important and not having a trimmed model for bulk-exchange doesn't seem like much of a problem at all.

With the confession that I am in the latter camp, my personal assessment of the importance of the tradeoffs is as follows:

stoicflame commented 13 years ago

It's great to hear some new voices!

I don't agree with the "either-or" scenario. We know that there are going to be at least some top-level elements that are specific to the WWW/REST interface. A root-level persona, for example, that includes it's record. Or a root-level representation of a record that includes the record's metadata. I'm thinking that there are going to be lots of others. Are you saying that all these things need to be in the same namespace as the "base" record? Why? That seems like a clear and unnecessary violation of the interface segregation principle.

I thought this thread was just about applying links on the "base" objects as needed so we don't have to end up extending every single object in the base model.

dkohlert commented 13 years ago

Merlin, If we go with the two model approach, I see no need for GedComx.org to provide the mapping utilities. If you ask my opinion. I think the GedComX specification should be nothing more than XSD schema definitions and documentation. In addition to that, GedComX.org could (and probably should) provide any number of references for that spec. The first reference would be Java based. As other organizations join the effort they could provide other implementations.

If I were to do this all from scratch I would take the following approach. 1) Create a model to represent source data without multiple views such as a Person centric view of a record in xml schema. There would be only one way to create/interpret the data. This model would be entire agnostic of what applications, APIs etc use it. 2) Create an implementation of that spec as a proof of concept in Java. 3) I would create an entirely different specification defining an api or whatever that meets the needs of that API. Only the implementations of this api that would ever deal with the source data model would be responsible for mapping the source data model into the model exposed by this API/application. Hopefully this would be only a small handful of organizations that would have to do this.

Why would I do this?
1) Applications and API models do not/can not influence the source data model. 2) Many different API/application models can be created without affecting each other. What if in the future the notion of links becomes irrelevant. It would be unfortunate that we would have to modify the source data model to remove or replace links when it was really never needed in the model to represent the source data anyway. Also, each of these apis can provide as many different views onto the data as they want without having to modify the basic source data model. Once you open the door to allow API/application specific information to be stored in the source data model you have to 1) maintain it forever, 2) fight the battle every time some party thinks that an API/application specific feature should be part of the source model just because it makes the implementation easier for them, even though the rest of the world would ignore it.

As a side note, IMHO, there are not going to be very many organizations/individuals that will ever have to deal with both the source data mode and any other API provided by GedComX.org or any other vendor and thus they will not have to deal with the mapping. Most applications will deal with some sort to access records 1 at a time as our API does.

carpentermp commented 13 years ago

I thought this thread was just about applying links on the "base" objects as needed so we don't have to end up extending every single object in the base model.

Yes, that's mostly what I meant. I never meant to imply that all of GedcomX would be in a single namespace, or that new representations would never be developed that have primarily "web-hosted" utility. But I think we need to consider with each representation, that "web-hosted" may not be the only use. For example, the "root-level representation of a record that includes the record's metadata" may also be useful in a bulk-exchange scenario.

carpentermp commented 13 years ago

I'm thinking that there are going to be lots of others.

I'm having some heartburn with the idea of lots of representations of the same data. It would seem to me that this is going to complicate the model, complicate the life of clients who now have to understand these different representations, and the versioning/caching model for representations that are a composite of multiple "entities" is still very unclear to me.

stoicflame commented 13 years ago

Doug, I appreciate your viewpoint, but I totally disagree with it. I just can't accept your reasons for doing it that way.

Applications and API models do not/can not influence the source data model.

The model has no purpose for being unless it's being consumed/provided by an application. And anyway, nobody's talking about influencing the model, we're just talking about adding some totally optional elements to the serialization format to support applications and APIs. And Merlin and I think you're going to need links in the bulk data exchange just as much as you're going to need it in the REST API. For example, somehow, you've got to link an entity with it's metadata, right? Is it just that you don't want to do it that way?

Many different API/application models can be created without affecting each other.

That's what spec versioning is for. By the time something becomes obsolete and irrelevant, we'll have a new version to move to, if it's done right.

Once you open the door to allow API/application specific information to be stored in the source data model you have to 1) maintain it forever, 2) fight the battle every time some party thinks that an API/application specific feature should be part of the source model just because it makes the implementation easier for them, even though the rest of the world would ignore it.

Yep. And I, for one, think those kinds of discussions and that tension is a good thing.

As a side note, IMHO, there are not going to be very many organizations/individuals that will ever have to deal with both the source data mode and any other API provided by GedComX.org or any other vendor and thus they will not have to deal with the mapping. Most applications will deal with some sort to access records 1 at a time as our API does.

That's a big assumption to make, I think. I also think it's kinda flawed.

stoicflame commented 13 years ago

I'm having some heartburn with the idea of lots of representations of the same data. It would seem to me that this is going to complicate the model, complicate the life of clients who now have to understand these different representations, and the versioning/caching model for representations that are a composite of multiple "entities" is still very unclear to me.

Yes, we definitely need to be careful.

By the way, I think I've been misusing the word "representation". Sometimes I say "representation" when I really mean "resource". The difference is that "representation" implies a different media type of the same "thing", when all I really mean is a new "thing" altogether. I.e. a new resource.

I'll try to be more careful in using my terms.

dkohlert commented 13 years ago

I think we need to have a meeting to discuss all of this.

As a reminder, Ryan, myself and Tom met on the links being included in the record model. it was decided that it would not be part of the record model. If you would like to open that back up, please schedule a meeting with Tom, Randy S, Steve C and myself and lets rehash this. Second, an entity should not be aware of its Metadata, if it is aware of its metadata, it is no longer metadata.

Another assumption that we are making is that the exchange between CDS and BOT is based the record model and not the www model. If this is not the case we need to discuss that as well.

carpentermp commented 13 years ago

By the way, I think I've been misusing the word "representation". Sometimes I say "representation" when I really mean "resource". The difference is that "representation" implies a different media type of the same "thing", when all I really mean is a new "thing" altogether. I.e. a new resource.

I'll try to be more careful in using my terms.

There is somewhat of a blurring between "representation" and "resource" that happens when we create "resources" that represent a "composite" of model entities. Strictly speaking, a composite would have to be considered a new "resource" because it has a different URL, but it is comprised of objects that each have their own url, and so are resources in their own right. This is one of the things that has me uncomfortable because I don't yet understand the versioning/caching model for these "resources".

In REST, it seems to me that the natural way of looking at it would be that every "resource" is an "entity" with its own "identifier" (the URL). Versioning and caching is based upon this. When we create a resource (entity) that is actually a "composite" of our actual underlying genealogical entities, I am still wondering how we give a version to it or report on its cacheability. Does anyone have any thoughts on this? (Examples of composite resources might be a "family" in the conclusion profile, or a "GedcomX bag" that holds a Person and all his 1-hop relatives.)

carpentermp commented 13 years ago

As a reminder, Ryan, myself and Tom met on the links being included in the record model. it was decided that it would not be part of the record model. If you would like to open that back up, please schedule a meeting with Tom, Randy S, Steve C and myself and lets rehash this.

My understanding was that this site is now the forum for discussing the model. Once we open it up to the community, it would seem that that would necessarily be the case. Given that, I am still unclear how it is intended that decisions be made now and in the future. Right now it would seem that they are made by Ryan as the "benevolent dictator" after an attempt to reach consensus.

carpentermp commented 13 years ago

an entity should not be aware of its Metadata, if it is aware of its metadata, it is no longer metadata

This seems like a semantic argument that doesn't address the appropriateness of metadata being available in a bundle with the data it describes.

dkohlert commented 13 years ago

We have to discuss what metadata your are talking about. I am not aware of any as of yet for the record model.

As for as decisions go, my understanding is, Ryan is the moderator, if a consensus can be gathered, he has the authorization to make it so, if a consensus cannot be gathered still need to discuss how this will be handled. If internal consensus cannot be achieved I think we have to defer to the architects as we have in the past.

carpentermp commented 13 years ago

Do you suggest this model even after we go to the community with the model?

dkohlert commented 13 years ago

Do you have an issue with the architects being involved?

carpentermp commented 13 years ago

No, no issue. I'm just wondering what will be the governance model for GedcomX? Will it be wholly controlled by FamilySearch? That appears to be what you are describing. It is a valid model, but has to be spelled out explicitly. Your statement that "if consensus cannot be achieved", we have to defer to "the architects." That's still a little fuzzy. What is the process for attempting to achieve consensus? What prompts the escalation to "the architects"? Which architects need to be involved? How do they make the descision? I believe all of this needs to be formalized when we take GedcomX to the community. If we are keeping total control, we may not need to explain to the community our internal processes for descision-making, but the community needs to understand what avenues are available to them to influence the model.

"Deciding how to decide" is one of the first orders of business for any OpenSource project.

dkohlert commented 13 years ago

I totally agree that we need to determine all of this before we go to the community. I am sorry that I misunderstood where you were going.

I don't think how we handle internal disputes has any correspondence to whether we keep total control or not. If for some reason, we need to resolve an internal disagreement, we go to our internal board, which I assume will include some number of the architects, come to a decision and then Ryan can make a post stating either the final decision if we maintain control, or our official FS stance if the community has control.

stoicflame commented 13 years ago

Hi guys.

Please comment on the pull request at issue #40 which is the current proposal to address this issue.

https://github.com/FamilySearch/gedcomx/pull/40

stoicflame commented 13 years ago

Fix applied at 159338c, along with fix for issue #62.