justify record/field terminology

stoicflame commented 13 years ago

The following commentary was submitted by @ttwetmore, and I'm opening it up for discussion, my comments to follow:

The GEDCOM X record model has an object called Record. This is an unfortunate name because record is such an overloaded term. Though the GEDCOM X Record record is obviously well-named, in the sense that it represents the digital representation of items of genealogical evidence extracted physical sources, and in many cases these physical items of genealogical evidence are naturally called records. So the term both makes a lot of sense while at the same time engendering a lot of confusion. The Record element/object/record is by now probably very entrenched in the GEDCOM X model and thinking, so changing it might be difficult. However, if any such change were contemplated, I would suggest the name Evidence as a good replacement.

The DeadEnds model doesn’t currently have an object corresponding to the GEDCOM X Record. I’ve gone back and forth on it. Like the GEDCOM X model the DeadEnds model does have Persons, Events, Roles, and Relationships, and when these objects are used at the “record model level” they refer directly to the sources they were extracted from, rather than having a Record object as a “middle man” between them and their sources. I have to admit, however, that for a service that provides evidence records, the concept of a Record object is very useful and nigh on mandatory.

This same distinction, that is, whether a Record object is required in the model to collect and contain the references to the objects extracted from a physical record, has been discussed on the Better GEDCOM project, with some members preferring having the Record record (generally called an evidence record in the Better GEDCOM context) and some not. I have been on the against side of the issue, but thinking about the needs of services providing records over the web I am softening.

Well, if calling a particular kind of record a Record record is confusing, GEDCOM X takes the confusion further by calling specific kinds of attributes Fields. The term field is as overloaded as record. The term Field in GEDCOM X seems to be limited to the idea that many records take the form of a form (overloading another term), and the GEDCOM X model views the contents of a form as being fields made up of key and value pairs. Well, it makes sense, but is confusing.

You have to admit that using a model that contains a Record class and a Field class, neither being used in their conventional computer-jargon manner, is a father risky thing to do. I try to use a single Attribute concept for all these ideas, where an attribute is a key/value pair. (There is a topic devoted to Attributes below.) In some contexts the values for keys are specified from specific sets of strings with special meaning (because they appear in those contexts so often, or bear such important information about the object they are attributes of), while in other contexts it is best to allow the keys to be assigned based on the properties of the records themselves. This second area covers the GEDCOM X Field concept quite well I believe.

stoicflame commented 13 years ago

Enhancing my comments on issue #72, I think there might be a misunderstanding of what the record model is designed to do. The GEDCOM X record model is designed to be another type of evidence that is a peer to other types of "non-conclusionary" evidence such as images, web pages, and physical artifacts. It's designed to model indexed record data. It's designed to be the output of indexing generic record data. It's not designed to be the "leaves" of the n-tiered model, it's designed to be cited as (another type of) evidence for those leaves.

The record model is not designed to model generic genealogical evidence, it's designed to be a unique type of evidence. So to use the word "evidence" to describe the record model would be inaccurate and misleading.

I think "Record" and "Field" are not "unfortunate" or "confusing" names at all. "Record" is the generic term used to define a set of data that was extracted directly from a generic record. There are many different types of records (census, birth, probate, military), and the "record" was designed to be flexible enough to model all of them. "Field" is a generic term that is used to define a "piece" of that record, such as a bounding box for an image. Yes, "field" is very much akin to a field of a form. This is seems pretty accurate to me since indexers are usually presented a UI that has a set of "fields" that they are to index. I'm not sure what's confusing about this, developers and users alike should understand "field" as such.

ttwetmore commented 13 years ago

I don't think there is confusion about what GEDCOM X calls records. I would define the GEDCOM X record as digitally extracted information from a physical item of genealogically significant evidence, and many of those physical items of evidence in the genealogical context are called records. Where you call the process of converting from a physical artifact to a digital record the "indexing" process, I call it the "extraction" process. For me "indexing" implies you are only digitizing enough of the physical artifact to be able to search for it or summarize it in some sense; for me the "extraction" process implies you are digitizing everything you possibly can from the artifact that fits into a full model of genealogical data. Since the GEDCOM X indexing process does squeeze every Persona, every Event, every Relationship, and so on, that it can from the physical artifact, no matter whether you call it indexing or extracting you are getting all the juice you can from it. Which I think is fabulous and marvelous.

My only point was that the words "record" and "field" are highly overloaded in the computer field. When one hears the term "record" one usually assumes a record in a computer database, and one hears the word "field" one usually assumes a sub part of a computer record or a component of a structured data type. So my point was only that there are some possible confusions one must anticipate with those two terms, not any disagreement about the concepts at all.

(And of course, by digitizing I only mean converting the information on a physical artifact into a form of structured data that adheres to a data model and can be processed by algorithms.)

stoicflame commented 13 years ago

I would define the GEDCOM X record as digitally extracted information from a physical item of genealogically significant evidence, and many of those physical items of evidence in the genealogical context are called records.

That sounds like a pretty accurate definition to me, too.

For me "indexing" implies you are only digitizing enough of the physical artifact to be able to search for it or summarize it in some sense; for me the "extraction" process implies you are digitizing everything you possibly can from the artifact that fits into a full model of genealogical data.

Fair enough. I can use the term "extraction".

My only point was that the words "record" and "field" are highly overloaded in the computer field.

Maybe so. But as a software engineer, I don't know when the last time I used "record" was. What is a "record" in the computer field, anyway? I have used "field" to describe fields on a form or fields on a data type, but it's always used in context.

ttwetmore commented 13 years ago

We are starting to beat the dead horse into the ground here, especially as I am content enough with the names Record and Field to be used for actual object classes in the GEDCOM X Record model.

But I am surprised that you don't understand my reasons for bringing up this issue, and especially that you don't see the confusing overloading caused by GEDCOM X's use of the Record and Field objects.

The word Record has come to mean any self-contained data structure in almost any context. Certainly all rows in relational database tables are conventionally called records. All objects stored in hierarchical and network databases are called records. The top level elements in many XML files are called records, generally because once that XML file is read by an application, those top level elements will be stored in a database as records.

You are making me feel very old!!!

stoicflame commented 13 years ago

Certainly all rows in relational database tables are conventionally called records.

Fair enough.

All objects stored in hierarchical and network databases are called records.

Okay.

The top level elements in many XML files are called records, generally because once that XML file is read by an application, those top level elements will be stored in a database as records.

Hmm... maybe not so much in this case. I've never heard the word "record" to refer to a top-level XML element. It's always just "element" or "root element" in my experience.

You are making me feel very old!!!

Oh no! I certainly hope not. I have no idea how old you are. All I know is that you've got a fantastic talent for articulating genealogical data models and your input is extremely valuable.

FamilySearch / gedcomx

justify record/field terminology #78