Distinguishing Between Original and Derived Data

thomast73 commented 12 years ago

The original gedcomx Record is intended to be able to accurately represent original historical documents. In representing historical documents, it may be desirable to represent empty fields as well as populated fields. The Record object can generally handle all of this in a reasonable fashion.

As the data is prepared to be published, it is common to derive values from the original values and store these derived values in the Record. An example is a treatment that derives a birth year based on an age. The derived value becomes a part of the record and does not need to be recalculated going forward. But when we add the derived field -- a field that did not exist in the "original" historical document -- subsequent processors are forced to conclude that the original document always included the derived field when in fact it never did. Downstream processors have no conclusive way to distinguish between fields that were original and fields that were derived.

Processors that operate on Records will need to distinguish derived data from original data. This become particularly important when data is re-treated. Treatments that are designed to operate on "original" values will want to either ignore derived data, or write/update their derived values. They will not want to treat derived values at all. So processors will need a conclusive way to distinguish an original value from a derived value.

stoicflame commented 12 years ago

Yes. This has been supported from the beginning. The Field class supports original for the way it looks on the record and normalized which is designed to be used for "derived" values. It also supports interpreted which is designed to be where a user-interpreted value is put (e.g. if "Con." is the original as defined by the record, the interpreted value could be "Connecticut").

dkohlert commented 12 years ago

stoicflame, I think what you are eluding to is that if the original field is blank, but the normalized field is present, then it must be a derived field. What I think what thomast73 is saying that there is no way to distinguish a derived field (a field that did not exist on the original record) and a field that did exist on the original record but someone removing the original value. In both cases you end up with a field that only has a normalized value and no original value. It would be nice to know that a field was not derived so that when the original value is removed, the entire field can be removed because there is nothing to normalize.

thomast73 commented 12 years ago

Within the scope of a "field", we can evaluate distinguish an original "field value" from a derived "field value". The "original" value is the original, all other values within the field structure are derived.

Within the scope of a persona / event / relationship / record, we cannot distinguish an original "field" from a derived "field".

Withing the scope of a record, we cannot distinguish an original persona / event role / event / relationship / etc. from a derived persona / event role / event / relationship / etc.

I know we have identified many potential use cases where we would derive fields. I can for sure think of a few cases for deriving events. My guess is that there are use cases for deriving many of the other types as well.

It is pretty easy to mark something derived at the time it was derived. It appears to be difficult or impossible to ascertain that these objects were derived after the fact -- depending on the state of the objects and sub objects. So, it seems useful to all objects to be marked "original" or "derived".

stoicflame commented 12 years ago

(Reopening the issue since I didn't understand what was being requested.)

Couldn't an automated process determine whether a field was "original" or "derived" based on the attribution? If a user supplied the empty field, then it was original, no?

thomast73 commented 12 years ago

I am not convinced that it would be a safe to assume that objects that have a missing/empty attribution object are "original". The seems implementation dependent. It seems legitimate to have "original" items that have been "attributed".

carpentermp commented 12 years ago

The use case that is driving this issue is still not %100 clear to me. Is this the use case?:

During the application of treatments you need to know which fields are "original" and which are "derived" so you can decide if you are able to delete them, or change them?

It seems to me that when treatments were first applied to produce a given "derived field" an attribution must have been created for it. Shouldn't it be possible during a reapplication of treatments to distinguish attributions to previous treatments?

Potentially, another way to tell could be the presence or absence of a "field label". This assumes that when treatments synthesize a field they don't also synthesize a field label for it. Or if the treatments framework wants to synthesize field labels, it could follow a label naming convention to distinguish fields created by the framework (e.g. begin with "_"). You may want to do this anyway to ensure that your field labels don't collide with any labels already on the record. Admittedly, this approach is implementation-specific.

This brings me back to my original question--what are the use cases? If the purpose is just so that the treatment framework can know what the treatment framework has done, then an implementation-specific solution does not seem inappropriate. Are there other use cases that would argue for a model change?

dkohlert commented 12 years ago

I suppose the use case is. As an external consumer of a record, it would be useful to know if a given field was even on the original record because one might not want any derived fields.

As you mentioned, we can't play games with the "field label" because that is implementation specific as is the attributions unless we want to define a standard attribution for this purpose that is part of the model.

So yes, internally we could do something implementation specific, but I believe as a consumer of a record, especially one coming from a 3rd party source where I may not have access to the image, one would want to know which fields were actually on the record.

thomast73 commented 12 years ago

Along the lines of what dkohlert has said, I have observed conversations among engineering types with an algorithm-research bent that would suggest that these algorithm developers would always like to approach a data set looking at original values only. Their experiments tend to start with questions like "If we start with the original data ...." In fact, they might argue that the record that represents the original ought to remain untouched altogether, and that any derived information ought to be part of a separate record that is designated as a conclusion about the original. If the original pieces that make up a document are preserved in such a way that they can always be identified as such, future algorithms that generate derivative data (normalizations, conclusions, etc.) can always go back to the beginning and work with the data without the encumbrances of previous derivations.

One of the strengths of the new gedcomx record model is supposed to be its ability to model -- in a robust and exact way -- the content of original historic documents. If someone goes through the trouble to model a document well, it would be good if future data consumers could accurately identify those original pieces so that they derive their own conclusions independent of any processing/conclusions that resulted from our own handling of the data.

carpentermp commented 12 years ago

If this is really important, then I think we have a bigger problem than what can be fixed by simply adding a "derived" flag to Field. Much of the Record is inferred and does not come from any "original" value at all. To illustrate, let me give an example, deliberately contrived to exacerbate a situation that is actually very common.

Suppose there is a book called "Church of the Holy Trinity, Santiago, Chile, Marriages of January 1, 1810. " By the book title we come to know that every marriage in the book happened in the same church on the same date. Suppose that each page of the book lists a marriage and that the fields are "groom, groom father, groom mother, bride, bride father, and bride mother." The template for the collection would include an "original" marriage event but neither the date nor the place would have an "original value" that came from the image. All the people would have genders, but none of the genders would have an original, or normalized, value. The relationships are all "original" but they are all inferred by context.

One question that arises is: when constructing the template for this collection, where do we put the marriage date and place? In "original", "interpreted", or "normalized"? None of these seem entirely satisfactory. "original" is usually for stuff coming from the image. "interpreted" usually implies a user interpreting the meaning of an "original value" and "normalized" is generally meant to be the normalization of an "original" or "interpreted" value. If we defined "original" to mean "a value coming from the source", whether explicit or implied, then "original" would seem like the place to put values of this kind.

Assuming we did this, it brings to light a further problem. To illustrate it, let's take another contrived collection, "Married people of Dullsville, Tennessee, 1900." The collection is nothing more than a list of names of residents of Dullsville TN, known to be married when the survey was taken in 1900. I chose this collection to illustrate a problem with "NormalizedValue" being the only place where "controlled vocabulary" meanings may be ascribed. The template for this collection would have a "principal" with a "name", a "marital status" and role in a "residence event". There is no ambiguity in the "marital status" and the marital status is not being derived from a string on the image, so putting the String "Married" in "original" doesn't seem right since that would still leave it to be "normalized" by some process. Ideally we would like to explicitly specify the enum URI "http://gedcomx.org/Married" right in the template. We could fill in the URI in the "NormalizedValue", but then it wouldn't look "original". We could fill in both the "original" and the "normalizedValue", but then the treatment processor might try to remove the normalized value and recalculate.

To solve this problem we could rename the "NormalizedValue" class to "Value" and change "original" and "interpreted" from "String" to "Value".

dkohlert commented 12 years ago

These are all good points. Let me address some of your points individually.

'Suppose there is a book called "Church of the Holy Trinity, Santiago, Chile, Marriages of January 1, 1810. " By the book title we come to know that every marriage in the book happened in the same church on the same date. Suppose that each page of the book lists a marriage and that the fields are "groom, groom father, groom mother, bride, bride father, and bride mother." The template for the collection would include an "original" marriage event but neither the date nor the place would have an "original value" that came from the image. '

Those places and dates are not derived, they are simply extracted from another location so they would still be original.

'All the people would have genders, but none of the genders would have an original, or normalized, value. The relationships are all "original" but they are all inferred by context.'

I would agree here, the genders and relationships are inferred so they should be in derived fields because we do not know that a bride's or groom's parents are actually married to each other. Since they are derived, they would not have original values.

Your second example collection would follow the same logic. The 'Marriage' status not extracted from each image, rather from the book itself as is the place and date so it would go into the 'original' field value.

Derived values, in my mind, are values that are derived from other values such as gender of a bride or groom, estimate birth year from a date and an age, etc.. Values that you can extract from the source image itself or the container of the source image (book, file, etc. that may have additional values that can be extracted) would not be derived fields and would be placed in the 'original' value of a field.

-----Original Message----- From: carpentermp [mailto:reply@reply.github.com] Sent: Thursday, October 06, 2011 4:19 PM To: Doug Kohlert Subject: Re: [gedcomx] Distinguishing Between Original and Derived Data (#86)

If this is really important, then I think we have a bigger problem than what can be fixed by simply adding a "derived" flag to Field. Much of the Record is inferred and does not come from any "original" value at all. To illustrate, let me give an example, deliberately contrived to exacerbate a situation that is actually very common.

Suppose there is a book called "Church of the Holy Trinity, Santiago, Chile, Marriages of January 1, 1810. " By the book title we come to know that every marriage in the book happened in the same church on the same date. Suppose that each page of the book lists a marriage and that the fields are "groom, groom father, groom mother, bride, bride father, and bride mother." The template for the collection would include an "original" marriage event but neither the date nor the place would have an "original value" that came from the image. All the people would have genders, but none of the genders would have an original, or normalized, value. The relationships are all "original" but they are all inferred by context.

One question that arises is: when constructing the template for this collection, where do we put the marriage date and place? In "original", "interpreted", or "normalized"? None of these seem entirely satisfactory. "original" is usually for stuff coming from the image. "interpreted" usually implies a user interpreting the meaning of an "original value" and "normalized" is generally meant to be the normalization of an "original" or "interpreted" value. If we defined "original" to mean "a value coming from the source", whether explicit or implied, then "original" would seem like the place to put values of this kind.

Assuming we did this, it brings to light a further problem. To illustrate it, let's take another contrived collection, "Married people of Dullsville, Tennessee, 1900." The collection is nothing more than a list of names of residents of Dullsville TN, known to be married when the survey was taken in 1900. I chose this collection to illustrate a problem with "NormalizedValue" being the only place where "controlled vocabulary" meanings may be ascribed. The template for this collection would have a "principal" with a "name", a "marital status" and role in a "residence event". There is no ambiguity in the "marital status" and the marital status is not being derived from a string on the image, so putting the String "Married" in "original" doesn't seem right since that would still leave it to be "normalized" by some process. Ideally we would like to explicitly specify the enum URI "http://gedcomx.org/Married" right in the template. We could fill in the URI in the "NormalizedV alue", but then it wouldn't look "original". We could fill in both the "original" and the "normalizedValue", but then the treatment processor might try to remove the normalized value and recalculate.

To solve this problem we could rename the "NormalizedValue" class to "Value" and change "original" and "interpreted" from "String" to "Value".

Reply to this email directly or view it on GitHub: https://github.com/FamilySearch/gedcomx/issues/86#issuecomment-2315737

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

carpentermp commented 12 years ago

I'm not sure I clearly addressed the posts by @dkohlert and @thomast73. The proposal would seem to be to add a "derived" boolean to "Field" (or possibly to GenealogicalResource). I suppose the default would be "original" and anyone or any process that derives something, like a birth year from an age, would set the "derived" bit on the Birth event, and all its sub-parts. Did I get that right?

I'm still having touble understanding the use case that makes this necessary. It seems to me that all values that are not "original" are derived. My understanding of "normalized" values are that they are "system assigned" and that the system is free to change them (or delete them) at will. Are you imagining a scenario where the system is deleting all "normalized" values (possibly during treatment re-application) and trying to decide what parts of the "structure" may also be deleted? In SoRD, we had a "trim" function that would walk the Record and trim out anything that was found to be "empty". Each kind of thing had its own definition for "empty" e.g. a date with no value; an event with no date and no place; a person with no gender, name, characteristic, or roles in events; a relationship without both persons, etc. The idea was that "empty" objects convey no genealogical meaning e.g. a "birth event" with no date or place is genealogically no different than the absence of the event, so go ahead and delete it.

Why not continue to behave this way? It seems like you are trying to preserve empty fields in the record--to what purpose? Are you imagining that you have identified a set of rectangles on an image where data should be but that some of these rectangles don't actually contain data because they were left empty on the form and you don't want to lose these rectangles, possibly to allow another user to come along later and supply a value? Then fields with associated source rectangles are not considered "empty".

carpentermp commented 12 years ago

genders and relationships are inferred so they should be in derived fields

Are you suggesting that these ought to be "interpreted" values? They seem more like "original" values to me. In the field "Groom Father Name" the gender and relationship to groom are implied by the word "father". There is no ambiguity. The information comes directly from the image. Though not from a place where a person actually wrote something, I don't see why that makes any difference to it being "original".

stoicflame commented 12 years ago

I'm still having touble understanding the use case that makes this necessary.

I think I'm in the same boat as @carpentermp. I just don't get it.

May I can approach this in a different way. Let's pretend we added a derived flag to the following:

field
relationship
persona
event

What would be the documentation for that flag?

dkohlert commented 12 years ago

You are right, gender in this case would be "original." However, the relationship between the groom's father and the groom's mother are not original because the record does not say they are married. For that case, we actually need a way of marking a relationship or an event as being 'implied.' I think the reason we need this is that I could easily see a record that would contain both the principle's marriage info, as well as the parent's marriage info. In this case both the principles marriage and the parents marriage are explicitly stated on the record. So I think we should either not include 'implied/derived' values/objects in a record or we need to somehow mark them as implied/derived. We also might want a way to mark values that we place in a record for internal purposes such as batch number etc.

I know that when we get data from a third party it takes some time to figure out what all of the fields are that they are giving us. It would be nice to know exactly which fields are explicit values from the original record or the context of the original record and which fields are their own internal fields for tracking as well as any derived fields that they have added.

-----Original Message----- From: carpentermp [mailto:reply@reply.github.com] Sent: Thursday, October 06, 2011 6:17 PM To: Doug Kohlert Subject: Re: [gedcomx] Distinguishing Between Original and Derived Data (#86)

genders and relationships are inferred so they should be in derived fields

Are you suggesting that these ought to be "interpreted" values? They seem more like "original" values to me. In the field "Groom Father Name" the gender and relationship to groom are implied by the word "father". There is no ambiguity. The information comes directly from the image. Though not from a place where a person actually wrote something, I don't see why that makes any difference to it being "original".

Reply to this email directly or view it on GitHub: https://github.com/FamilySearch/gedcomx/issues/86#issuecomment-2316669

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

carpentermp commented 12 years ago

@dkohlert wrote:

the relationship between the groom's father and the groom's mother are not original because the record does not say they are married

Again, I would personally not consider this case as "implicit" because I don't think a "couple relationship" implies marriage, and I would say that if two people have a child in common that is sufficient to consider them a "couple".

However, your point is well taken. Let's consider a case that I really would consider implicit--the census case. Suppose a household of three people, the HEAD, WIFE, and SON. The implicit relationship here is the WIFE-SON parent-child relationship. It is true that there is a little "hole" in the model with respect to relationships. For fields, there is no ambiguity about the "originality" of the values--we have that modeled explicitly (only "original" values are original). But the existence of a "relationship" is genealogically interesting in and of itself. It doesn't need any "characteristics" or anything like that to justify its existence.

In SoRD, we handled this with "attribution". In the census case, when generating a relationship between the mother and the children we put a lower confidence on the relationship and a "reason" of "children of the head are usually children of the wife of head." In the attribution it was also possible to tell the difference in an attribution between a live "user" and a "process". To me, this still seems like a reasonable way of handling this.

carpentermp commented 12 years ago

Also, as far as I can tell, no one has commented on my suggestion that "NormalizedValue" be renamed to "Value" and that "original" be of type "Value" instead of "String". I would like to get everyone's thoughts on this. Remember, my reasons for this were for when the document specifies some unambiguous enum value (like "married" or "male") but has no area on the form where this information was written in--it is just known from the context. In this case, it is not good to have to put in a String value like "Male" that later has to be treated in order for the gender to be unambiguously known. It would be better if the template could specify "http://gedcomx.org/Male" as the original value. This is not possible as long as "original" is a "String."

ianstiles commented 12 years ago

Original - Put in the record's field value from the image, or some other known value that applies to the record at a higher grouping level (book title or cover, chapter, range of dates, etc). If the field is present, but no value was written in, then the value should be a blank string, but not a null. Interpreted - What the indexer person puts in when they think the record author meant something else. Normalized - Put in standardized, normalized, or derived values here that add further value to the field and record. Leave untouched values as null, like when inferring the gender, leave the original value as null, but fill the normalized value.

So here is the meaning of the three states of a value: Null - there was no place on the image or collection for the value. Blank - there was a field on the image, but no written value available. Filled - there was a field and it had a value.

This set of rules meets the requirements of knowing the state of the data and when to normalize, allow deletes, etc.

It is proposed that we adopt these sets of rules and definitions and leave the model as is.

carpentermp commented 12 years ago

I'm personally very reluctant to define a distinction between "null" and "empty" in this fashion because:

I believe it will be error prone--define it as explicitly as you will, people/systems will get it wrong.
It precludes the paring of "empty" portions of a Record, as I described in an earlier post. Paring creates a third state--"null" vs. "empty" vs. "missing" (meaning the structure that would house that value doesn't even exist in the Record). You can equate "missing" to "null", but unless you also equate "null" to "empty", you can't delete portions of a Record that are found to be "empty of data". We have gotten big savings from this technique and I'm loathe to create a model that precludes it since, even if the pipeline no longer chooses to do this, there may be others using the model who don't feel this way.

I believe we already have a way of making the distinction you want, but it all comes down to the definitions of "original", "interpreted", and "normalized", and the definition of "Field label".

Let me propose a slight modification to the wording of your definitions:

Original - A value that came directly from the source. This includes the fields of a form; un-fielded information captured directly from the image, e.g name written in the margin; but also stuff known by the context, like "bride's gender", "census year", etc.
Interpreted - A value that a person infers. Generally, this will be an inference from the "original" value of the same Field, but not necessarily. It may be an inference from other information on the Record, e.g. a person infers a full name from the known parts.
Normalized - A value that the system infers. Often, this will be an inference from the "original" or "interpreted" value of the same Field. However, it may be an inference from other values on the Record e.g. the system may estimate the birth year from the "age". (With this definition of "normalized" it could be argued that a better name might be "system" or as Jeff once proposed, "processed". Calling it "normalized" is confusing when an "original" or "interpreted" value is actually as normalized as it can get, e.g. an "original" gender whose value is "http://gedcomx.org/Male.")

In these definitions, "interpreted" and "normalized" are always derived and thus, theoretically, could always be deleted and reconstituted. We would probably never want the system to delete "interpreted" values because reconstituting them would ostensibly take user intervention. However, since the system is in charge of "normalized" ("system") values, it would always be free to delete or change them.

In addition, I would make no distinction in the model between "null" vs. "empty" vs. "missing" values. Thus, fields with empty "original" and "interpreted' values may deleted by the system--whether derived or not. Events with empty Date and Place Fields may be deleted, etc. This allows users of the model to "pare", if they choose.

This still leaves us with your desire to express, "the Record had a field (little "f") on a form, but the field was left blank." This case only comes up with "fielded" Records, which is to say, you have a "form", with a set of "fields", that repeats over a set of Records (usually the whole collection). Since the fields repeat over every Record, the knowledge of "what fields are on the form" need only be stored/expressed once--at the collection (or uniform record set) level. This is basically what we do today--every collection has a template, with a set of fields and what we sometimes call the "display map" (information about which fields should be displayed, and how). When displaying a record, we show the fields defined in the "display map"; paring a record makes no difference to how it is displayed (or to our knowledge about what fields existed on the form). "missing" values are shown the same as "empty" or "null".

The last part of the solution comes from how we define "Field label". From reading this thread I believe we may have differing assumptions about how Field labels ought to be used. I propose the following definition:

Field label An identifier for a Field that represents a field (little "f) on a form, unique to the form from which the Field was extracted.

Note that, with this definition, when a treatment generates something (like a "birth event" from an Age), it should not generate field labels for the generated values. Field labels are only for giving an identifier to fields on a form. Derived values don't come from the form, so they don't get a field label.

This is contrary to what we do today. Because Records came to us from EASy, already treated, and because a Record, for them, is nothing more than a set of pipe-delimited values, we ended up with many field labels on derived values--there was no other choice. This was "handy" for display maps where we would use field labels to indicate what "fields" to display and where we tend to display "normalized" (e.g. PR_NAME--a derived value) rather than "original" values. Unfortunately, using field labels in display maps is very problematic. It provides no way to display data from a semantic standpoint, e.g. "display the 'best' name of the principal's father". It breaks down completely for free-form Records that have no field labels at all.

For these reasons, it would seem that we need another, more general, way of identifying what to display. I would suggest XPath expressions (or perhaps something XPath-like). In addition, I would suggest that most Record information could be displayed without any collection-specific directives at all.

Despite this, there will always be form-specific fields that we may want to refer to and this is where the "Field label" comes in. I would say that, from a model standpoint, "Field label" should always be considered optional. However, when present, it would indicate a "Field" that came from a "form". This would give you the ability to make the distinction you want to make, "this Field came from a field (little "f") on a form, but the field was left blank." To do this you would choose not to pare Fields with Field labels, even when empty or null. You would do this, not as a model constraint, but as an implementation choice, because of your desire to express this information explicitly on the Record, rather than refer them to the collection where the set of fields on the form are listed.

dkohlert commented 12 years ago

I agree carpentermp that we should not be distinguishing between 'null' and 'empty' and how one can determine if a field is on the record or not. I also have been pushing for sometime the idea that many structured records for the most part can be displayed without a display map. The sooner we can get rid of the current display map the better!

stoicflame commented 12 years ago

I really appreciate @carpentermp's comments. He's got a lot of great insight. There are a few things I still don't understand; I'd appreciate some help.

The first question is pretty simple: does it work in all cases to say that "derived" fields (also known as "inferred" or "calculated" fields) are the same as fields without a field label?

The second question is regarding @carpentermp's definition of original, interpreted, and normalized:

Original - A value that came directly from the source.

Interpreted - A value that a person infers.

Normalized - A value that the system infers.

I, for one, hadn't ever included the notion attribution (i.e. whether the user said it or the system said it) with the definitions of original, interpreted, and normalized. I had always thought, for example, that either the user or the system could set the normalized value.

So here are the rest of my questions:

Does everybody agree that @carpentermp has proposed the right definitions for "original", "interpreted", and "normalized"?
If we do agree, can we rename "normalized" to "processed" to be more clear as to its purpose?
If we do agree, can we open another issue to discuss how to provide for the case where a user specifies a normalized value (e.g. picks a value from a list)?

carpentermp commented 12 years ago

does it work in all cases to say that "derived" fields (also known as "inferred" or "calculated" fields) are the same as fields without a field label?

In answer, I will refer to this from my earlier post:

I would say that, from a model standpoint, "Field label" should always be considered optional. However, when present, it would indicate a "Field" that came from a "form".

If "Field label" is optional, then the absence of the label cannot be used to infer that a given field did not come directly from a form. It's presence however, means that it definitely did. So the short answer is no, "missing field label" does not mean "inferred". However, "missing original value" does mean "may be pared without loss" and, as an implementation choice, our system may decide not to pare Fields that are: "missing original value, with field label."

We could have a discussion about whether or not making Field label optional is the right choice, it just didn't seem right to me to insist upon it being there.

carpentermp commented 12 years ago

I missed the meeting on the definitions of "original", "interpreted", and "normalized". I apologize--I completely spaced it. Did the meeting take place? What was the outcome?

I have been thinking about how the definitions for "original", "interpreted", and "normalized/processed" play into the usage of Attribution. This has brought to light more issues and tradeoffs.

Let me begin by giving definitions for "original", "interpreted" and "normalized" that are closer to what I think most people were thinking before I made my proposal. This will allow a comparison between the two approaches:

Original (String) - came right off the image/source and was "entered/filled-in information"--not something known from context. Stuff known from context would be supplied as "normalized".
Interpreted (String) - a value that is the direct interpretation of the "original" value. Makes no sense if "original" is blank. Would generally be provided by a user, but it could conceivably be filled in by the System (or we may decide to stipulate that only users supply these values. Let the system supply only normalized values).
Normalized (NormalizedValue) - most normal/prettiest value. The reason for its existence may be:
- The system (or a user) "normalized" an "original" or "interpreted" value, making it "prettier" and/or calculating "what it means" e.g. by supplying an enum value
- The system (or a user) created the entire field by inference from other data in the record. In this case, the "original" and "interpreted" values will be empty.
- The value may have been known by context e.g. gender of Bride. In this case, as in the previous, the "original" and "interpreted" values will be empty.

This leaves us with the problem that has been noted by several--when there is no "original" value, how can you tell if the "normalized" value is "from the source" or "inferred"? To me, this information seems logically to belong to the "Attribution." Attribution is information about who/what put this information here, when, and why, and confidence level.

One approach would be to stipulate that "inferred" fields must have an Attribution--the absence of an Attribution could be a clear indicator that the value was known from context. But, would the presence of an Attribution be a clear indicator of "inferred"? No, that will not do. A user may have added a piece of information "known from context". Perhaps we should add a special "ConfidenceLevel" value that means "known from context"? Or perhaps a separate "inferred" flag? This second approach does not appeal to me because it only makes sense when "original" is empty. When original is not empty it would be a contradiction to have the "inferred" flag set. Perhaps it would be better to put the "inferred" flag on the "NormalizedValue" class. Thus, instead of indicating that the Attribution is inferred, it would indicate only that the NormalizedValue was inferred. With this approach, if the original value is not empty then the NormalizedValue is guaranteed to be inferred, whether set or not. If original value is empty, then the inferred flag comes into play.

We will also need a way of distinguishing between "user-inferred" and "system-inferred." Attribution has a Contributor. What is the mechanism for telling the difference between user contributors and "the system"? Would an Attribution with no contributor mean "the system did it"? Or should there be a special "system contributor"? In SoRD we had a "process identifier" so the system could know not only that the system did it, but which process/treatment.

Assuming we iron all this out, we are left with an additional problem related to the fact that we have only one Attribution on a Field. The upshot of that fact is that when any change to a field is made, the "changer" is forced to take responsibility for everything in the Field. For example, suppose a user in an "ad hoc" extraction provides a new "original" value. Suppose another user "interprets" the original value. The original Attribution on the field is replaced by an Attribution to the "interpreter"--thus he has effectively taken credit for the "original" value as well. That's o.k.; it's not a stretch to say that whoever takes the effort to "interpret" a value is also re-asserting the original.

The problem really comes in when the system normalizes the value. If it changes the Attribution, the system ends up taking responsibility for the original value and its interpretation. We lose the real "who", "why", and "confidence" for the interpreted value. Not good. If we stipulated that only users can add interpretations, then at least we would remember that it was a user that gave us the interpretation, but we would lose why they thought it was right, and how confident they were. So do we leave the Attribution unchanged when the system normalizes a value? We can't. Because users can supply normalized values, that would make it appear that the user normalized the value. The "confidence level" (which really belongs to the "interpretation") would also appear to be the confidence of the normalization. Bummer.

Thus, with this approach, it appears there is no escaping the need for separate Attributions for original/interpreted, and normalized values.

Now let's turn our attention to how Attribution plays out with my suggested definitions for original, interpreted, and processed (rather than normalized) values. In this scheme, users supply original and interpreted values. They supply values as "original" when they come directly from the source (whether "filled-in", or "known from context"). As in the previous scheme, when a user "interprets" an original value, they take responsibility for both the original and the interpretation so we're still good with a single Attribution on the Field.

Now what about when the system "normalizes" an original or interpreted value, by adding a "processed" value to the Field? One option is to make no change to the Attribution. We have this option in this scheme because we know that all processed values are added by the system and no one else. It is true that the system has no way of expressing why it put the value there, or how confident it was in doing so. That is a legitimate limitation to this approach and I suppose we need to decide if it is o.k. or not.

However, there is some precedent for this approach. In the conclusion model, Dates and Places have an "original" and "normalized" value. The "original" is "what the user told us" and the "normalized" is "how the system decided to normalize it". When the system adds or removes a normalized value there is no Attribution for this, nor is the conclusion logically modified. So it works essentially the same way.

Thus, in this scheme, the system is free to add or remove "processed" values at will, and there is no ambiguity about what is "original" and what is "inferred", so no "inferred" flag is needed. If we decide that we can get by without "system Attributions" then we are also o.k. with a single Attribution as the model currently specifies.

In summary, here are the changes that would need to be made in each of the two options:

Option 1

Rename "normalized" to "processed" (or "system" or whatever)
Rename the class "NormalizedValue" to "Value"
Change "original" and "interpreted" to be of type "Value"

Option 2

Add an "inferred" flag to "NormalizedValue" (Not ideal, but probably the least weird place for this.)
Add an Attribution to "NormalizedValue" (This has issues, but may be the least of all evils.)
Decide how to indicate the "System contributor." (special value?)

stoicflame commented 12 years ago

The meeting did happen and it turns out that most everybody was already on board with the notion that the "normalized" property of Field is the system-processed value. It was just me that was confused, mostly by the naming. So here are the concepts that were formalized:

original: Text directly extracted from the record field. What you see is what you get, including misspellings and other errors.
interpreted: User interpretation of what the original value means, used optionally as needed to enhance the original value by correcting misspellings and other ambiguities. The interpretation is different from a conclusion because it should be made only within the context of the record and not be based on knowledge obtained from other sources.
processed: Programmatic interpretation of the value based on an algorithm that considers the original and interpreted values.
A normalized value is a value whose text has been formatted for the purpose of easier processing (perhaps for display purposes). Normalization might be based on a known standard.
A standardized value is a value that has been resolved to a discrete, machine-identifiable value based on a specific standard. A value that has been standardized will either refer to a specific item of a constrained vocabulary (via resource references) OR constrain the value to a standard using the datatype, creating an RDF Typed Literal.

At 2b39e0f3746e1d6a57a5f12bb888752c7fab6353, 92f1ec9484311e284f5c7472c633078bd8717da8, d5dad15ce1588383888274bb9f89a6448a5c5869, and d17d9296c126f833e89cdb8076dc0b93a719c2b4 the following actions were taken:

Rename normalized to processed.
Rename NormalizedValue to FormalValue (because it handles both normalization and standardization).
Change processed to be of type FormalValue.

Note that original and interpreted are not of type FormalValue. I don't think that's necessary, but not everybody's aligned on that yet. I'd invite @carpentermp to open up a separate issue describing the need for making original and interpreted properties of type FormalValue so we can discuss.

FamilySearch / gedcomx

Distinguishing Between Original and Derived Data #86