merge RecordField and Characteristic

FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.

http://www.gedcomx.org

Apache License 2.0

350 stars 67 forks source link

merge RecordField and Characteristic #69

Closed jeffph closed 12 years ago

jeffph commented 12 years ago

In the record model, RecordField and Characteristic are structurally identical. Can we choose one of these to use throughout the record model and remove the other? A rename of the selected class may also be necessary for this change.

dkohlert commented 12 years ago

stoicflame commented 12 years ago

So I'm not totally clear as to what you're proposing. What exactly would you like to name this RecordField/Characteristic?

Just to be clear, it's not totally accurate to say that the two structures are identical because they're designed to take a different set of types, right? So you'd also have to merge the two types, too, right?

jeffph commented 12 years ago

The naming is always the hard part. :)

A while back, we just had Characteristic and all the types were in CharacteristicType, including record characteristics. This seems like it would work ok since CharacteristicType already has values spanning multiple entities (i.e. Persona and Relationship). Seems like adding Record wouldn't be too much of a stretch.

stoicflame commented 12 years ago

I suppose I don't have an objection to adding Characteristic to Record. @carpentermp, what do you think?

carpentermp commented 12 years ago

I would not be in favor of this change. My thinking on this issue is a little involved, so allow me to give some background. In my mind, Records in our model have a dual nature. In some cases, a Record may be simply a set of Fields, with no structure. In other cases, a Record may be comprised only of structure and have no Fields (or more correctly, since Names, Dates, Places, etc. are all derived from Field, we might say that there are no Field "labels"). Theoretically, a Record could initially be extracted simply as a set of Fields, and the structure may be imposed afterwards by applying a template. (While we may be getting away from this, it is what we have been doing for years.) When the template is applied, some fields may be brought "into the structure" of the Record (i.e. into the Personas, Events, Relationships), while other Fields remain in the unstructured part of the Record as a set of residual Fields. Nearly all the Records from our current Collections would fall into this hybrid state, where the Record is neither all "structure" nor all "Fields". Also, in Records that have Field labels, even after structure has been imposed there remains the need to sometimes view the Record as though it were a completely unstructured set of Fields.

The SoRD model did not contemplate the "hybrid state" of "structure" and "residual fields". Instead, a Record was logically "all structure", or "no structure" or both (with the data expressed redundantly). When imposing structure via a template, Fields that had no other place to go were forced into Record Characteristics. Characterstics have a CharacteristicType and so we were forced to choose one, and we chose "OTHER". Because of the need to sometimes view Records as a set of Fields, we gave Records in the model a list of Fields, and we provided a way in the web service to request a Record normally (with no Fields returned--all structure), or as a set of Fields (with no structure returned), or both--with the data expressed redundantly.

(As an aside, there is one feature provided by the "Fields" view of a SoRD Record that, as far as I know, has still not been addressed in GedcomX --perhaps I should open an issue on it--SoRD Fields have a "display name", which is the localized name that should be shown to the user for the field when displaying the Record in "field: value" form. An empty display name meant that the field would not be shown to users. For data exchange, however, you would really specify this information once for the whole collection—not on every Record. This is currently done via the “display map” of a collection. Currently, English is the only language our pipeline supports for these values, but there has been a lot of discussion about translating them. When fetching a Record in a web service for display, it would be necessary to have a way of accessing the “display names” of the Record in the desired language. This could be accomplished with a URL to the collection where this information could be found, but having to consult the collection in order to display a Record might be a little tedious. It might be better to have this included automatically in the web profile of a Record.)

Now, with that background, let me go over what I see as the advantages of “RecordField” over Record Characteristic. Characteristics on Record always seemed a little bit of a wart in the SoRD model to me and I was looking forward to not having them anymore in GedcomX. As far as I know, there never were any Record Characteristics that ever had a CharactersiticType besides OTHER (with a description, which was generally the EASy field id). What was missing from the SoRD model, and would have been really useful, was a “datatype” for the fields of a Record. For those Fields in the “structured” part of the Record, the “datatype” is always known implicitly—A Name has a NameType, a DatePart has a DatePartType, etc. What about the Fields that are not in the structured part of a Record—what is their datatype and why would this be useful?

Over the years we have had several discussions about “waypoints” and “light indexing” and how a lightly indexed Record “grows up” to be a full-fledged Record. I have always said that it ought to be possible to use the Record model for waypoint-style data. After all, a Record is an extraction of genealogically interesting data from an arbitrary division of a source. Waypoint data fits this definition in every particular. The division boundaries for waypoints typically coincide with image boundaries, whereas the boundaries for many of our “full fledged” Records do not, but this doesn’t change the nature of the data and needn’t change the model used to describe it. A Record of RecordFields, each with a datatype, would accommodate Waypoint-style data very well and pave the way for Waypoint Records that “grow up” to be (or provide seed information for) full-fledged Records at some point. The datatype allows some treatments to be applied to the data that could not be applied without it. For example, there is a desire to be able to search waypoint data, but in order to be able to search effectively we need to know when a value is a “place”, a “date”, a “name”, or a “race” or whatever. (This is an oversimplification: some names are “surnames”, some dates are “months”, some places are “counties”, etc., so there are actually many datatypes.) Even if someone disagrees that the Record model should be used for Waypoint data, collection-specific search would really benefit from a datatype on Fields that don’t end up in the structured part of a Record.

I suppose it would still be possible to rename RecordField to Characteristic, and create a bunch of characteristic types that are really datatypes, but this would seem to muddy the model.

stoicflame commented 12 years ago

Wow. This was an awesome response. Thanks for taking the time.

To be honest, I was having my own reservations about collapsing RecordField and Characteristic back again. (RecordField just doesn't "feel" like a Characteristic to me.) But I wasn't able to articulate why so I didn't object to it. Having @carpentermp's response is really helpful and I think it's very convincing.

I wanted to respond to this:

there is a desire to be able to search waypoint data, but in order to be able to search effectively we need to know when a value is a “place”, a “date”, a “name”, or a “race” or whatever.

Very good point. Since we've integrated with RDF, we already have well-defined "type URIs" for name, date, place, etc. So it's a very natural thing to do something like:

<record>
  <field>
    <type>http://gedcomx.org/record/v1/Name</type>
    <value>...</value>
  </field>
  <field>
    <type>http://gedcomx.org/record/v1/Place</type>
    <value>...</value>
  </field>
  ...
</record>

The model is already designed to support that.

jeffph commented 12 years ago

We need to be careful with how we intend to use Record Fields/Characteristics. In my opinion, we shouldn't have alternate renderings of the same record within the same model. If a different view of a record is needed, such as a fielded view for presentation purposes, either the extensions should be used or the record should be mapped to a separate implementation-specific model (my preference) that more directly supports a fielded view. Otherwise, the inconsistent state of a record will cause us to receive inconsistent updates. That is, when we receive a record to be saved, we wouldn't be able to reliably determine if a user is simply editing a value within the field-view of a record vs. moving data out of our structure and intentionally flattening the record to fields-only in order to "start over".

Additionally, if a type is known for a RecordField, it seems like we should go ahead and put it into the actual object for that type. For example, if we know a Record Field/Characteristic is a date or place, let's create a Date or Place object within an Event, even if nothing else is known about the event. This makes for easier linking later. Otherwise, there's not much benefit in keeping it out of the structured part of the model.

For reasons above, and for several other more significant reasons that are too involved to enumerate here, the record model won't work for browse data.

The Record Fields/Characteristics, then, should only be for extra items that don't fit into the structure, in my opinion. There is a case to be made, however, for not having Record Fields/Characteristics at all. As I've analyzed the Record Characteristics in our existing data, it seems they can be grouped into the following categories:

Bibliographic information - Examples are volume number, page number, NARA publication number, etc. This should probably go in metadata in gedcomx.
Internal/Operational information - DGS Number, Batch, ImageNumber, RecordGroup, etc. These might ought to be in the extensionElements or a seprate subclass. Metadata is also an option for some of these items.
Summary information - BirthYear, BatchLocality, HouseholdId, DeathAge. This information is only for convenience and is simply derived from more detailed data within the record. In my opinion, it should not be part of the record proper.

So, maybe we should remove Record Fields/Characteristics altogether to minimize mis-use of the model. I lean towards keeping them in the model for insurance, but I could be convinced to remove them as well, at least until we have valid use cases or increased demand.

stoicflame commented 12 years ago

In my opinion, we shouldn't have alternate renderings of the same record within the same model.

Still trying to understand this statement; having multiple representations of the same data is a well-proven technique. But I don't think that's what we're talking about here.

If a different view of a record is needed, such as a fielded view for presentation purposes, either the extensions should be used or the record should be mapped to a separate implementation-specific model (my preference) that more directly supports a fielded view.

I don't think @carpentermp was talking about "presentation purposes", I think he was talking more about data lifecycle; i.e. the record actually changes as it's "born" as a fielded record and then modified to support structure.

Otherwise, the inconsistent state of a record will cause us to receive inconsistent updates. That is, when we receive a record to be saved, we wouldn't be able to reliably determine if a user is simply editing a value within the field-view of a record vs. moving data out of our structure and intentionally flattening the record to fields-only in order to "start over".

Why does it matter? If the record is "updated" into a field view, store it as a bunch of fields on the record. If it's structured, store it structured. Like I mentioned earlier, I don't think we're talking about different representations of the same data; I think we're talking about different data.

Additionally, if a type is known for a RecordField, it seems like we should go ahead and put it into the actual object for that type.

You may be right. Personally, I haven't formed a strong opinion about it.

For reasons above, and for several other more significant reasons that are too involved to enumerate here, the record model won't work for browse data.

We might have to get into the ones that you haven't enumerated because I'm not seeing the "why not" yet.

So, maybe we should remove Record Fields/Characteristics altogether to minimize mis-use of the model.

Maybe. I assume by "misuse" you mean using fields for "browse data"? Because I'm not yet convinced that's misuse. But I'm open to being convinced.

dkohlert commented 12 years ago

The browse model is used to browse artifacts, not records; you cannot use a record to model the browse artifacts because it is not a one-to-one mapping. Now, the model that is used for browse data may be similar to a record so that the data can be copied from the artifact to a record when the record is defined, but I think we need to keep the two different objects 1 for the artifacts and 1 for the records.

As far as mixing structured records and just fields on a record. An unstructured record goes against everything that we have been trying to achieve with this data model. Now, that does not preclude that a "fielded view" of a record cannot be served up via an API, but as far as the source record and what we store and pass around, we should be only dealing with the structured form.

jeffph commented 12 years ago

Apologies for not being more clear. Here's a bit more explanation to help.

My opposition to alternate renderings is based on this comment:

Also, in Records that have Field labels, even after structure has been imposed there remains the need to sometimes view the Record as though it were a completely unstructured set of Fields.

Yes, of course multiple representations of the same data is very conventional. However, the standard practice is to have separate models for each representation, assuming they have different requirements. For example, a data model is typically a different model than a presentation model. You'll notice my statement had the significant qualifier "...within the same model". With this clarification, we have the edit inconsistencies I mentioned.

The proposed data life-cycle described earlier is not entirely accurate for gedcomx. Once data is in gedcomx format, it isn't in name-value pairs anymore. That is, we have templates and processes to map field data into a gedcomx record. Field-values are never imported into a gedcomx record as a flat list of Characteristics/Fields, and then subsequently repositioned into the structure.

Glad to discuss browse requirements in more detail, but this should be done in person. There's a lot involved.

carpentermp commented 12 years ago

Yes, I think we need a meeting (possibly a series of meetings) to discuss "browse data" since it appears that our thinking on this issue is still quite divergent. If it is true that "Record" isn't suitable for browse data, then we will need a GedcomX model for browse data, and we don't yet have one.

Getting back to the question of RecordField vs. Record Characteristic, there may be some misunderstanding of my previous post. Allow me to try to clarify.

The proposed data life-cycle described earlier is not entirely accurate for gedcomx. Once data is in gedcomx format, it isn't in name-value pairs anymore. That is, we have templates and processes to map field data into a gedcomx record. Field-values are never imported into a gedcomx record as a flat list of Characteristics/Fields, and then subsequently repositioned into the structure.

With every template mapping there exists the possibility (up to now, it has always been the case) that some of the fields are not explicitly mapped into what I would call "structure" (i.e. there is no known semantic relationship between the field and any Persona, Event, or Relationship). They are, in effect, "leftover fields." Regardless of what template is used to create a Record, there exists the possibility that we may later want to apply a new "smarter" template to the Record that maps previously unmapped fields, or changes the mapping of previously mapped fields. The dumbest possible template doesn't map anything explicitly. The smartest possible template has a perfect place for every field in the Record.

@jeffph wrote that "Once data is in gedcomx format, it isn't in name-value pairs anymore." I believe that is the crux of the question we have been considering. The statement was true for SoRD, but necessitated forcing all name-value [aka fieldLabel-value pairs, aka "fields"] into "Record Characteristics". Characteristics have a "type". The "leftover fields" had no need of a type, so OTHER was chosen. As I mentioned in my previous post, this always seemed like a wart. If we had ever come across a real "Record Characteristic", we would have created a CharacteristicType for it. No such need was ever found.

In the gedcomx Record model, Characteristic is a subclass of Field, but also has a "type". If the type is not needed for "leftover Fields", why make them Characteristics? Why not just leave them as Fields?

Currently in gedcomx, we have RecordField, which is also a subclass of Field, and also has a type. This thread was started with the observation that RecordField has the same "structure" as Characteristic i.e. it is a subclass of Field with an added "type". In my previous post I tried to argue that the "type" on RecordField ought to be a "datatype", not a CharacteristicType. As far as I can tell, no one has argued that we need "CharacteristicType" for the fields of a Record. But, it seems that Jeff and Doug have taken exception to the reasons I gave for why a datatype would be useful on the Fields of a Record. They seem to be saying that name parts will always go on a Persona, and date and place parts will always go on an Event. Even if this is true, I believe there are lots of reasons why a datatype will be useful for the Fields of a Record. Here are a couple more:

1) aids to user input (numbers can be validated, controlled vocabularies can be a drop-down list, etc.) 2) for search optimization (numbers can be stored as numbers)

If others are not persuaded that the datatype on RecordField is useful, then we could remove it. At that point, we may need to consider if we should collapse RecordField down to Field, make Field no longer abstract, and have Record contain a list of simple Fields.

jeffph commented 12 years ago

There's probably a bit more background information that you guys may not be aware of that may provide more context to this subject. In the not too distant future, templates will be created for every indexing project and will part of the project definition. Indeed, as we speak, a template editor tool is being created for internal users to map every data element captured in an indexing datagrid into what we've been calling the "structure". Project owners won't be able to add a column to the project datagrid without it being in the template. We, then, have the capability to completely avoid Record Fields/Characteristics altogether if we so choose. So, in the future, all new records that we author will only be gedcomx, and we may never see "leftover fields" from our own records.

The concept of "leftover fields" applies when you have one or a small number of templates trying to be all things to all tabular data. We are departing from that notion. With a template editor, we have the ability to easily create (or, more typically, copy and modify) custom templates for every different structure we see, whether it be an FSI project or importing a text file. Thus, giving us the ability to minimize, if not prohibit, leftover fields.

If we ever need Record Fields/Characteristics, I am convinced, though, that the type member is not necessary. As I've looked through the data, it simply isn't that useful and is typically "other" as @carpentermp mentioned. This, and the fact that we can effectively avoid leftover fields are the reasons why I half-heartedly floated the idea in my earlier post of even removing Record Fields/Characteristics from the model.

As for adding data-type to Record Fields/Characteristics, we're significantly minimizing the usage of these to begin with. If, for some reason though, we need the user to enter a Record Field/Characteristic, more than likely it is a very custom field (i.e. one that isn't a date, place, name, etc.; otherwise it would already be in the structure). It would be difficult to 1) conceive of a data type at the record level that isn't already defined in the structure (even supposedly numeric data can't be enforced to a numeric type because we inevitably have alpha data on the image for these fields), and 2) conceive of a type at the record level that isn't already in the structure that we could provide generic help for. The same principle holds true for search optimization. That is, a data-type that isn't already in the structure probably can't be that strong of a type, and therefore, we typically can't optimize anything for it.

FWIW, our template object has a place for custom help information for each cell, regardless of whether it's in the structure or not. This is often needed for each individual collection because, for example, the input instructions for birthdate in one collection may differ from another.

carpentermp commented 12 years ago

You seem to be saying, "there won't be any 'leftover fields.'" We must be misunderstanding each other. What I mean by "leftover fields" are fields that have no relationship to any Persona/Event/Relationship and so end up as Fields on the Record. For example, "page number" is a very common field. "page number" doesn't have anything to do with a Persona, Event, or Relationship. It is a Record thing, and so ends up in the list of Fields on the Record. To my knowledge, every Record we have ever produced through the pipeline has fields like these, and I would expect we would still have Records like this even with custom templates created at the time of indexing.

It seems that you are no longer suggesting that we ought to combine RecordField and Characteristic? I don't think you are suggesting that we ought to remove the list of Fields on a Record? Are you just saying we can remove the "type" (datatype) from RecordField? That is a possibility, but I would like to wait until we have our "browse data" discussions before doing this.

jeffph commented 12 years ago

Well, I suppose I'm exploring different solution options. :)

Yes, I'm beginning to consider more seriously if there are ever any legitimate cases for a Record to have its own list of Fields/Characteristics, assuming the record doesn't need to be "initialized" with a list of fields prior to putting data into the structure. The page number example seems to be more bibliographic and perhaps should be in metadata, no? Otherwise, how do we know what bibliographic information to put in metadata vs. fields/characteristics on the record? The three categories I mentioned in my earlier post seem to cover all record-level data that wouldn't fit into the structure, and none of these seem like they should be on the record itself, in my opinion.

If, however, I'm missing something and "page number" is common and should be on the record, then it seems more consistent with the rest of the record model to have a CharacteristicType of "page number", and specify a RecordCharacteristic, similar to how we would handle this for Persona and Relationship.

carpentermp commented 12 years ago

With respect to "page number" being metadata, my rule of thumb would be that anything we have users capture directly from the image would need to be "data", not "metadata" since that gives us the ability to capture an "image rectangle" for where it came from, produce a "normalized value", ascribe an attribution, allow for user corrections, etc. Some of that stuff might end up in metadata as well (e.g. a primary event date and place might end up on the record coverage), but would need to be captured as data initially so that we get all features that we are building into regular data.

jeffph commented 12 years ago

I see your point, but that can be a slippery slope since an image could potentially (although not often) have quite a bit of bibliographic information on it (volume, author, publisher, etc.). Seems a bit redundant to have these in metadata and as record fields/characteristics.

If this is really how we'd like to proceed, then I believe all the bibliographic stuff should also be CharacteristicTypes that are used for RecordCharacteristics.

Without getting too far off subject, maybe we're missing a metadata-indexing use case or requirement that needs to be fleshed out? We do have title page images that could play into this.

stoicflame commented 12 years ago

that can be a slippery slope since an image could potentially (although not often) have quite a bit of bibliographic information on it (volume, author, publisher, etc.).

So isn't that a good reason to keep record fields separate from characteristics since they'd be allowed to adapt to a broader, looser set of requirements?

Seems a bit redundant to have these in metadata and as record fields/characteristics.

There's definitely redundancy there, but there's a good reason for it because it follows a well-established pattern whereby fields are used to capture data as it appears on a record, and those fields are used as the units for building out or compiling additional context and utility such as metadata and conclusions.

jeffph commented 12 years ago

So isn't that a good reason to keep record fields separate from characteristics since they'd be allowed to adapt to a broader, looser set of requirements?

No, the Record Characteristics don't have broader requirements themselves. They're still Characteristics of the Record regardless of how/if they're duplicated in metadata.

fields are used to capture data as it appears on a record, and those fields are used as the units for building out or compiling additional context and utility such as metadata and conclusions.

This statement is also true for all subclasses of Field, including Characteristic, Date, DatePart, Place, PlacePart, etc. I don't see any conflict here.

If RecordCharacteristics only have data that doesn't fit into the structure (and doesn't include data that is to be relocated into the structure later), then why do they need to behave any differently than the other Characteristics? Why wouldn't we want this consistency and simplicity (exemplified below) in the model?

CharacteristicType.Person.Household CharacteristicType.Person.GedcomUUID CharacteristicType.Person.UniversalId etc...

and

CharacteristicType.Record.PageNumber CharacteristicType.Record.Publisher CharacteristicType.Record.Volume etc...

carpentermp commented 12 years ago

In order to keep clarity about where the discussion stands, Jeff, have you acceded to the idea that whatever we capture directly from the image is captured as a subclass of Field? If not, perhaps it would help if you got more explicit about what you propose. It might help to take an example collection that has data from images that you wouldn't want captured this way and if you would address the concern that we will probably have about captured data that doesn't have the facilities that Field provides.

As far as I can tell, the discussion so far has brought to light two additional sticking points:

Explicit datatypes or not.
Fields vs. Characteristics on Record.

Let me add something that occurred to me with respect to the first question. I have been proposing that "RecordField" have a datatype to aid producers and consumers of the data to know how to treat it. It occurs to me that an explicit datatype could potentially be useful for Characteristic values as well. For example, the "count of children" Characteristic is a number. We have other Characteristic types where we would like to restrict the values to a controlled vocabulary e.g. "lineage type" or "race". It could be argued that the datatype is implied by the Characteristic type. It is true that standard CharacteristicTypes could be understood in this fashion, but what about Characteristic types that a client doesn't (yet) understand? If we ended up following Jeff's suggestion that we try to put meaningful CharacteristicTypes everywhere, then we will be adding types with regularity so that any given client is likely to encounter types it knows nothing about. A datatype on these could aid in the consumption and presentation of this data.

If we agreed to add an optional datatype to Characteristic (the default being String), then Characteristic and RecordField are different only in the presence or absence of the CharacteristicType and we can consider point 2 from the standpoint of this difference alone. By giving Record a list of Characteristics, rather than a list of Fields, we require a CharacteristicType for every captured value. This is what we did in SoRD, so we have experience with it. In practice, the type was always "OTHER", and we never found a need for anything else. I am not against the idea of Record Characteristics with types, if it can be shown that we have a legitimate use for specific types. Forcing the choosing of a type still seems like a shoehorn for captured values that don't have a type we care about. You can always default to OTHER, but why? I have some uneasiness with Jeff's suggestion that we actually try to give meaningful types to everything, e.g. PageNumber, Publisher, Volume, etc. This would seem to lead us down a path where we end up creating potentially hundreds of Record CharacteristicTypes since there tends to be great variability from collection to collection. We might also find ourselves trying to decide if "document number" is the same thing as "certificate number", or "file number" or whatever. As long as we have no real use for these types, creating, categorizing, and deduping them would all seem to be wasted effort.

stoicflame commented 12 years ago

For the record, I'm open to adding a "datatype" property to either characteristic/field/whatever, but I'd really like more proof of an immediate practical use case that would leverage such an attribute. I can understand the theoretical need for such a property, but I'm getting stuck on the practical implications of adding it to the model. Take, for example, the following characteristic sample:

<characteristic>
  <type>count of children</type>
  <original>four</original>
  <interpreted>4</interpreted>
</characteristic>

Assuming the application of a datatype property would happen as an attribute, where should that attribute go in this case? On the characteristic element? On the original element? On the interpreted element? On all of the above? And for each case, what is the datatype?

Thanks in advance for your comments.

stoicflame commented 12 years ago

I also would like @carpentermp to clarify a bit when you say that in practice, the type of record-level characteristic was always OTHER. In SoRD, there were characteristic types defined for things like "batch number", "page", etc. Are you saying that those were never used? Is that because the field name ("label" in GEDCOM X) was just used instead?

Thanks in advance.

carpentermp commented 12 years ago

I also would like @carpentermp to clarify a bit when you say that in practice, the type of record-level characteristic was always OTHER. In SoRD, there were characteristic types defined for things like "batch number", "page", etc. Are you saying that those were never used? Is that because the field name ("label" in GEDCOM X) was just used instead?

You hit upon the one field for which we do something special in our search system--batch number. (We don't do anything with page number or any of the others). This is a legacy field that users have been writing down for many years. In general, they use it to limit their search results because, in practice, batches were generally small groups of records from the same time and place. They did this because historically we have done a really poor job in our collections of preserving the original groupings and ordering of records. Batch number was how users managed to get around this shortcoming in our archival practices. If we had a collection model that allowed records to be ordered and grouped into hierarchies of arbitrary depth, then batch number would lose its usefulness.

To fully answer your question, in Records we identify the batch number Characteristic by fieldId (field label in GedcomX). I suppose it could be argued that this shows a valid use for Record Characteristic. It does, but it is a use that is specific to our operations, rather than to the community in general. Still, it could be argued that to provide for implementation-specific usages of this kind, we ought to allow for Record Characteristics. I could probably be persuaded to this view, but I might want to explore having both Fields and Characteristics on Record so you don't have to force everything into Characteristic.

Now you might ask, why didn't we use CharacteristicType.BATCH_NUM? If I had remembered BATCH_NUM when I was writing the Infobahn template and the Oreo template, I would have mapped it--It was just an oversight. The reason CharacteristicType.BATCH_NUM is in SoRD is because it was used in CP. CP migrated a bunch of data from ODM where that information was added to extracted persons. Also, BATCH_NUM was used to some benefit in the CP matching algorithm. If you are curious to know how, ask Randy Wilson.

carpentermp commented 12 years ago

For the record, I'm open to adding a "datatype" property to either characteristic/field/whatever, but I'd really like more proof of an immediate practical use case that would leverage such an attribute.

More reflection upon this subject has me wondering if even adding a "datatype" won't be sufficient. If you look at Gender I think it illustrates something we are missing from the model. Gender is a Field, so it has a "text" value, but also has an enum which indicates what the text "means". E.g. the text "male" means GenderType.Male, and in a Spanish collection the text value of "m" may be short for "mujer" which means GenderType.Female. It is important to preserve what the original document said so that users can verify that we have made the proper interpretation from "what it says" to "what it means".

It occurs to me that we have some characteristics that would work better if they were more like Gender, e.g. LineageType and Race are two that come to mind. When the document has a field, then we need capture what the field says, but we also need a way of expressing what it means. When the "Race" field of a Spanish document says "blanco" what does it mean? It means RaceType.White. We currently have no way of communicating this. A normalized string may be fine for humans to look at and know what is meant, but in data exchange from computer system to computer system we have not fully closed the loop of semantic meaning. This makes me wonder if Characteristic ought to have a URI in addition to a text value. The URI would be used in cases where the CharacteristicType is of a type where the normalized values are mapped to an ENUM. Or perhaps we need ordinary Characteristics, and EnumCharacteristics that behave the way I just described.

Here is another related problem. Some characteristic values would benefit from being standardized across all collections (e.g. race, lineage type), but others have a constrained set of values unique to a given collection. For example, The "Texas Deaths" collection has a "county" field that indicates which county the death was recorded in. The constrained values are the set of counties that existed during the period of time covering the collection. We probably need a way of expressing this when communicating collection information. This would be useful during record extraction, and when users are doing a "collection-specific" search on the collection. If this information were available, users could be presented with a drop-down list of choices, rather than a simple text field.

Finally, the last issue deals with my original suggestion that we have a "datatype" for characteristics. Suppose we came out with version 1.0 of GedcomX and we had no notion of CharacteristicType.Race. Suppose someone implements a client that supports v1.0. Suppose we then add CharacteristicType.Race and to the model along with an enum of RaceType's. The 1.0 client has no idea how to interpret Characteristics of type Race. It has no idea which URI values make sense as possible values of the Charactertistic, so it can't present the possible values to a user.

Perhaps our EnumCharacteristic (if we had one) should have a URI for the "URI-type" (e.g RaceType) and a URI for the value. Perhaps we could establish a convention that the "URI-type" uri would be a working URI that would return something that lists the known URI values. This isn't very well baked, I know, I am just brainstorming a little bit to see if it can spark a discussion about these issues.

jeffph commented 12 years ago

While I don't see any value in defining a data-type as numeric or something similar (there are always legitimate rule-breakers for these stronger types), I do agree we need more support for extensible controlled vocabularies.

So, to add to the brainstorming, we may already have support for this in our existing model. Currently, our known type URIs look like this: "http: //gedcomx.org/Country". Couldn't our URIs for types be more qualified like this? "http:// gedcomx.org/PlacePartType/Country".

This approach seems to have the following advantages:

1) We could now have some light validation and/or warnings when someone is putting the wrong KnownType in the wrong part. For example, we could notify a user/system when they're using setType to put a DatePartType into a PlacePart. 2) We can now extend CharacteristicTypes by specifying the URI-type as something like this: http:// gedcomx.org/Race/Caucasian. 3) We can scope the URI-type to a specific collection, if necessary, by adding a collection ID qualifier to the URI. 4) The scope of the values are more clear to the consumer.

Seems like an endpoint as @carpentermp suggests would work for fixed lists. For the more prevalent cases of an initial list that is shared and extensible by the user (think combo-boxes), one could either derive all known values from a collection or start with a global list from the endpoint. Either approach could work.

stoicflame commented 12 years ago

This is some great discussion, guys.

I'm getting a bit buried in the theory here, and this particular issue seems to be snowballing into three distinct issues. I think what I'm going to do is open up two other issues for (1) datatype attribute on characteristic and (2) extensible type vocabulary (on which I have some thoughts to share) and keep this one open for discussing collapsing characteristic and field.

So can we limit discussion on this thread to just whether we should delete RecordField and use Characteristic? Can I get a synopsis of @carpentermp's current position on this?

carpentermp commented 12 years ago

I am still in favor of having a list of Fields on Record. I am not in favor of changing the list of Fields to a list of Characteristics because of the "CharacteristicType" issue. If we did this, it seems to me that we have a couple of ways to go:

Try to give meaningful types to Record Characteristics.
Generally set the types to CharacteristicType.OTHER

Option 2 still seems like a shoehorn. Option 1 has the problems I cited earlier:

This would seem to lead us down a path where we end up creating potentially hundreds of Record CharacteristicTypes since there tends to be great variability from collection to collection. We might also find ourselves trying to decide if "document number" is the same thing as "page number", "certificate number", "file number" or whatever. As long as we have no real use for these types, creating, categorizing, and deduping them would all seem to be wasted effort.

So my choice would be to continue with Record Fields. If, at some point, we discover legitimate cases of Record Characteristics with types we care about, we can add a list of Characteristics to the Record model at that point.

jeffph commented 12 years ago

I'm still confused. RecordField has a FieldType. How is this any different than using a RecordCharacteristic with a CharacteristicType?

carpentermp commented 12 years ago

As I wrote in my first post, I always thought that the fieldType was a datatype, not a characteristic type. If it is not a datatype, then I get where you are coming from. I believe the question about the usefulness of "datatype" has been moved to a different issue, so this issue would seem to be just about the question of Characteristics vs. Fields on Record (type vs. no type).

jeffph commented 12 years ago

Currently, it doesn't appear to be a datatype since its values are Household and BatchNumber.

In addition to these being RecordCharacteristics, we'll also need to add all types related to Record-level bibliographic information, such as PageNumber. Otherwise, how else can these values be copied or pulled into the metadata for full bibliographic support? Seems like our metadata implementation will need to be aware of some of these specific types.

carpentermp commented 12 years ago

Yes, I hadn't noticed the "FieldType" class before. I can see why you suggested that RecordField be renamed to Characteristic--that's exactly what it is right now.

As you pointed out, currently there are 2 types in FieldType: "household" and "batch number". In SoRD, it was part of the template to say which field was to be used as the "household" field. I suppose it would work to have a household characteristic instead. The only time it wouldn't work is if the field was needed in the 'structure" of the Record, but for "household" that is hard to imagine.

We have other cases where we tell the template which fields to use for different things, specifically RELATIONSHIP_TO_HEAD and EVENT_TYPE [2,3,4]. RELATIONSHIP_TO_HEAD is used to determine relationships when stitching census households and EVENT_TYPE[2,3,4] are used when the template has to deal with generic events where the type is passed in as part of the data. If we applied this same pattern for these, we would need CharacteristicType.RelationshipToHead and CharacteristicType.EventType (weird name).

As I mentioned in an earlier post, I think BATCH_NUM is specific to our processes and so the URI that defines the type should probably be in a different namespace.

It looks like we have some valid use cases for Record Characteristics. However, I am not sure it would be useful to model "page number" as a known characteristic type. For bibliographic citations, I believe we will probably want something like a "citation template" which identifies fields in the collection and how they are put together to form a proper citation. These would probably be identified via field label rather than by characteristic type. There is probably going to be a lot of variability from collection to collection about what goes in, and how it looks and a template would embody all that variability pretty well, it would seem to me.

I am still uncomfortable with the idea of creating a lot of Record Characteristic types that we don't need because of the reasons I have already cited. My preference would be that Record has a list of Fields AND a list of Characteristics.

jeffph commented 12 years ago

I can understand the potential for seemingly endless lists of CharacteristicTypes for "pageNumber", "certificateNumber", etc. We also have this risk for other types in our model; PlacePartTypes (there are hundreds of these) and Person[a]Characteristics come to mind. It will take careful management to make sure these don't become unwieldy.

At this point, I suggest we go ahead and merge RecordField and Characteristic and open up a separate issue for adding type-less Fields to Record.

stoicflame commented 12 years ago

Apologies for the delay in this; I'm at JavaOne, making my bandwidth limited.

Please note the proposal at #88, which picks up on the conversation about data types and "extensible controlled vocabularies".

Also note issue #87, which is designed to address how clients "learn about" additions to "extensible controlled vocabularies" that come in later versions.

stoicflame commented 12 years ago

Please see #99 as the proposal for putting this issue to rest.

stoicflame commented 12 years ago

Closed with the application at #99.