FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

Bad Object Oriented Design #129

Closed jralls closed 12 years ago

jralls commented 12 years ago

The design of GedcomX reveals several bad characteristics:

There's more, and I'll add them (or others can) as I continue to analyze the code.

EssyGreen commented 12 years ago

I think that's a bit harsh and I think that GEDCOMX is attempting to go further than just "writing a data interchange format". Whether that is right or wrong is another matter but personally I welcome the broadening of GEDCOM from what was previously just an XML schema with little guidance on what the data really meant or how it should be interpreted. In order to get to a good data interchange format we have to have a good data model to exchange in the first place and I think that is what we're trying to do here.

However, I think you have a touched on a valid point in that the specific format/syntax of the data (XML, RDF, FOAF, DC etc) is maybe shaping the data model rather than the other way around.

jralls commented 12 years ago

This issue is not about the data model, nor about the file format. It's about the program design. Yes, we're here discussing the data model, but Ryan has written a computer program. I'm an experienced programmer (40 years or so), and the design of the program does not reflect good design practice. Harsh, perhaps, but engineers are like that.

GEDCOM is not an XML schema. Except for the never-released GEDCOM-6, it's not XML at all. As you point out, its lack of a well-expressed data model has limited its usefulness as an interchange format.

An XML schema is a fine way of expressing a data model -- though not if one is planning to use that data model for a relational database. A major problem with GedcomX at present is that a computer program is not a good way to express a data model, because there is too much detail getting in the way.

EssyGreen commented 12 years ago

An argument about the program design doesn't seem to me to be particularly helpful. We would have as many different views of right and wrong as we have people participating. I believe that the important thing for us to get right is the data model.

GEDCOM 5 was the equivalent of an XML schema in the days before XML became common-place (possibly even before it existed I'm not sure when XML was 'invented').

I agree that an XML schema is not optimum in defining a relational data model and that was sort of my point - that the data model seems to me to be being shaped by the decision to use XML rather than vice versa. I think that's my perception of your point re level of detail.

EssyGreen commented 12 years ago

... however, I think we may (hopefully!) be in agreement that it would be useful to have separation of the data model from the details of the interchange format.

stoicflame commented 12 years ago

So I'm trying to figure out what exactly is being requested; I'm going to need some clarification. I think there are two issues here (correct me if I'm wrong).

  1. You feel like there needs to be a separation between the data model and the serialization format.
  2. There are some bad practices that need to be removed in the code.

I totally agree with number 1. We knew that the interchange format is the primary thing that is to be delivered. The code has always been secondary to it. When deciding what tools to use to define the serialization format, we chose to use Java code to do so. It had some nice benefits: we were really familiar with it, Java's a pretty easy language for developers to understand, and Java has some really rich tools and specs for defining serialization formats (e.g. JAXB, JAX-RS).

Defining the serialization format in code also had the great benefit that the code was already there, ready to be used and consumed. Personally, I've been really happy with the choice. When we adjust the definition of the serialization format, we automatically get the code updated.

So help me understand: was this issue opened to challenge that decision? Is there a better development process being suggested?

jralls commented 12 years ago

Nope, no problem with Java per se, though as I've said elsewhere I think that designing in code is a poor practice, and you really need to get the specs into English so that non-programmers can join in the fun. But that's not the issue here, it's design.

The biggest and deepest problem is that almost all of the references inside the model are URIs. That completely conceals their purpose and prevents them from being constrained. I don't see any type-checking code in the setters, either. There's a few cases where a reference is of type TypeReference and one (Record::Attribution) where it's a ResourceReference -- neither of which reveals what sort of object it's a reference to. The "core" classes, the first-class classes in Common, Conclusion, and Record, should have attributes which are other classes in those packages. Conversion to URIs for interchange should be done in a serialization function in the referred to class. For example, GenealogicalEntity has an attribute persistentId of class URI. Instead, the attribute should be of class persistentId and GenelogicalEntity::serialize() should call PersistentId::serializeAsURI() to get the URI. This reveals the intent of the persistentId attribute and ensures that it can't be any of the other classes that also serialize as a URI.

The next bullet, Type Member Variables, arose because neither I nor Argo understood how Java does generics (Argo still doesn't). I see now that it's a convoluted effort to constrain the values of the "part" classes. Wouldn't it be simpler to just declare the enum types in the part class definitions and be done with it? Has there already been a discussion on the trade-offs of hard-coding the part type names? It makes sense for something like Marital Status, but there's no way you can reasonably foresee all of the possible values of Event type.

I thought the last point, Getters and Setters, is pretty obvious. You only need setters if you're going to change an attribute after construction, but for the most part these classes are going to be invariant: You'll construct them from the application's classes as appropriate and iterate or recurse over them calling their serialization methods to construct a file or stream for output or you'll call their deserialization constructors to build them from a file and then extract the values from their attributes to populate the application's objects. That last step can be done with getters, of course, but in this case data hiding is rather pointless, so you might as well just make the attributes public and let the application have its wicked way with them.

Clearer now? It's not that you defined the (first two) serialization formats in code, it's the way that code is plunked into the middle of the model instead of being out on the edge.

To go a bit further, you might consider separating the format code from the actual serialization code so that it's easier to write the specification in English and so that if for some reason you become disenchanted with enunciate you can easily change to something else.

EssyGreen commented 12 years ago

The biggest and deepest problem is that almost all of the references inside the model are URIs. That completely conceals their purpose and prevents them from being constrained.

+1

Getters and Setters

This isn't an issue for me since to make all the properties fixed fields instead is just adding constraints which won't add anything.

EssyGreen commented 12 years ago

Also (a slight deviation from the original point but nonetheless related) can we ensure that the Data Model can easily be implemented as either a hierarchical model and/or a relational model?

The tendency to explode small structures into embedded objects and to inherit everything from one or two base entities (see #135) is easy to implement in say XML but a nightmare for say SQL. We can of course embed the hierarchy in the relational database but this weakens the power of the relational aspects.

If not, then I would agree with @jralls in that GEDCOMX should focus only on the data exchange syntax and leave the logical data model alone. In which case much of the discussion in this forum is irrelevant.

jralls commented 12 years ago

Getters and Setters

This isn't an issue for me since to make all the properties fixed fields instead is just adding constraints which won't add anything.

Remember, we're not talking here about objects in your application, we're talking about objects that are transferred between applications. Your application will convert its representation into these objects, call their serialization methods, and send the results somewhere, or get a serialized stream, feed it to the factory object, get a bunch of these objects, and convert them to its representation.

stoicflame commented 12 years ago

Clearer now?

A bit clearer, but not totally. You do make some good points.

I think what you're saying is that you're missing the back-end object-oriented code that conforms to those "good" design principles you're citing. Is that accurate?

The Java code you're looking at in the repo was never designed to be more than the raw serialization code. There are some helper methods, etc, in there that make that serialization easier to do, but it's not designed to be used as strong business objects. All three of the OO design violations you make above can be explained by constraints on how the types need to be understood by JAXB so the serialization format is accurately defined.

jralls commented 12 years ago

Also (a slight deviation from the original point but nonetheless related) can we ensure that the Data Model can easily be implemented as either a hierarchical model and/or a relational model?

The tendency to explode small structures into embedded objects and to inherit everything from one or two base entities (see #135) is easy to implement in say XML but a nightmare for say SQL. We can of course embed the hierarchy in the relational database but this weakens the power of the relational aspects.

For what value of "easily"? Object orientation is pretty much the standard paradigm for software design these days, in large part because it facilitates mapping between the program and problem domains. RDB architecture maps well to a limited set of problem domains (it was invented for accounting and inventory tracking). For everything else, there's Object Relational Mapping. See Fowler, Patterns of Enterprise Application Architecture.

If not, then I would agree with @jralls in that GEDCOMX should focus only on the data exchange syntax and leave the logical data model alone. In which case much of the discussion in this forum is irrelevant.

Look, this is a software project. If you don't want to review the code, that's fine, but code reviews aren't irrelevant, they're the reason Github provides this facility: It isn't a "forum", it's a bug tracker. If you want a forum, vote for one at #42. You might add that you want two, one for code and one for "higher level" discussions.

However, until Ryan addresses #114, the code is canonical. It is the model. Everything else is an illustration and not necessarily up to date or even correct.

jralls commented 12 years ago

I think what you're saying is that you're missing the back-end object-oriented code that conforms to those "good" design principles you're citing. Is that accurate?

I'm only missing it if it's there and I don't see it. ;-)

I'm mostly looking at UML imported from the code, and I see a muddled design that doesn't properly expose the intent of its elements. Remember, one of my interests in this is implementing it in Python, and if the intent isn't clear, alternate implementations won't be, either.

The Java code you're looking at in the repo was never designed to be more than the raw serialization code.

That's not how you billed it at RootsTech. You said there that it was the reference implementation. In #114 you acknowledged that it's also the only complete representation of the proposed standard, which would include the data model.

There are some helper methods, etc, in there that make that serialization easier to do, but it's not designed to be used as strong business objects. All three of the OO design violations you make above can be explained by constraints on how the types need to be understood by JAXB so the serialization format is accurately defined.

No, the design errors can be resolved so that you have a stronger representation of the model and implementation neutral (i.e. not dependent upon enunciate) (de)serialization that clearly reveals the serial format.

I've got some work to do in other projects first, but I'll get something more concrete worked up for you in a week or two. In the meantime let's focus on #138.

stoicflame commented 12 years ago

You said there that it was the reference implementation.

If I said that, I was wrong. There's a lot more needed to be done to create a proper reference implementation.

In #114 you acknowledged that it's also the only complete representation of the proposed standard, which would include the data model.

And I still agree with that.

No, the design errors can be resolved so that you have a stronger representation of the model and implementation neutral (i.e. not dependent upon enunciate) (de)serialization that clearly reveals the serial format.

I don't disagree that we need a stronger representation of the model. The only representation of the model that we have today is derived by enunciate from the serialization format classes that are constrained by JAXB. I'm still not convinced that having a new (different) set of classes that are not constrained by JAXB that conform to the OO design principles you reference is the best way to do that. I guess I was kinda hoping that we could annotate the serialization classes in such a way that enunciate can do a better job of articulating the model.

EssyGreen commented 12 years ago

@jralls

Getters and Setters Remember, we're not talking here about objects in your application, we're talking about objects that are transferred between applications.

Yes I am aware - that's sort of my point - I can easily ignore the getters and setters in the java code we have at the moment. I don't give a fig - if Ryan wants them then fair enough. I'm not here to set the programming standards of the API but to ensure I understand (and hopefully can shape) the data and structure it will cater for.

Object orientation is pretty much the standard paradigm for software design these days

I was never objecting to OOD just to the specific way that the model is making it difficult to implement it in anything but a hierarchical form.

It isn't a "forum", it's a bug tracker

Really? If this is the case then I'm definitely in the wrong place. I was under the impression we were being directed here to participate in the development of the standard. I'm reviewing the code in that I want to understand the detail of what is evolving - I'm certainly not doing technical debugging and testing!

stoicflame commented 12 years ago

If this is the case then I'm definitely in the wrong place.

You're not in the wrong place. This is an issue tracker and we're currently using it as a forum, too. If you think a mailing list is a better tool for these kinds of things, add your comment to #42

jralls commented 12 years ago

If this is the case then I'm definitely in the wrong place.

You're not in the wrong place.

Except that you're probably not interested in this issue, which is about reviewing the code.

EssyGreen commented 12 years ago

@stoicflame - phew thanks! and no I'd rather not go via a mailing list thanks. This format suits me fine :)

@jralls - I was interested (and agreeing with) the need for separation of object model and serialisation code. Just not agreeing with you on some of the other points (e.g. Getters and Setters).

jralls commented 12 years ago

Sorry for not getting back to this sooner, I had a couple of higher-priority issues to deal with.

The only representation of the model that we have today is derived by enunciate from the serialization format classes that are constrained by JAXB. I'm still not convinced that having a new (different) set of classes that are not constrained by JAXB that conform to the OO design principles you reference is the best way to do that. I guess I was kinda hoping that we could annotate the serialization classes in such a way that enunciate can do a better job of articulating the model.

It appears from my cursory look that JAXB is a code generator for making Java serialization classes from an XML Schema. What does "constrained by JAXB" mean? Why would JAXB be so important that it would be permitted to constrain the design, when traditional XML tools (SAX and DOM) impose no such constraints?

stoicflame commented 12 years ago

It appears from my cursory look that JAXB is a code generator for making Java serialization classes from an XML Schema.

Correct.

What does "constrained by JAXB" mean? Why would JAXB be so important that it would be permitted to constrain the design, when traditional XML tools (SAX and DOM) impose no such constraints?

The design of the model isn't constrained by JAXB. It's the nature of the source code that's constrained by JAXB. Setters, for example, are used by JAXB to populate the properties of the object.

jralls commented 12 years ago

It appears from my cursory look that JAXB is a code generator for making Java serialization classes from an XML Schema.

Correct.

Then you'd be ahead by specifying the XML Schema and letting JAXB do its thing rather than trying express your XML Schema in JAXB-like Java code.

The design of the model isn't constrained by JAXB.

But the design of the model exists only in your head. The rest of us have to try to figure it out from the code, and that's made much more difficult by your "JAXB constraints" and RDF dependencies inside the model classes. Sarah has opened #144 and #146 which are specific examples of my first complaint here.

Now, backing up a bit:

I don't disagree that we need a stronger representation of the model. The only representation of the model that we have today is derived by enunciate from the serialization format classes that are constrained by JAXB. I'm still not convinced that having a new (different) set of classes that are not constrained by JAXB that conform to the OO design principles you reference is the best way to do that. I guess I was kinda hoping that we could annotate the serialization classes in such a way that enunciate can do a better job of articulating the model.

That doesn't seem to be working out, but I think that this particular issue isn't going to resolve the problem, so I'm closing it in favor of #114.