Change type of "original" and "interpreted" from String to FormalValue

FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.

http://www.gedcomx.org

Apache License 2.0

355 stars 67 forks source link

Change type of "original" and "interpreted" from String to FormalValue #96

Closed carpentermp closed 12 years ago

carpentermp commented 12 years ago

Note that original and interpreted are not of type FormalValue. I don't think that's necessary, but not everybody's aligned on that yet. I'd invite @carpentermp to open up a separate issue describing the need for making original and interpreted properties of type FormalValue so we can discuss.

I made these arguments in my posts on the other thread.

Remember my example--the "gender" of the "bride". There is no "String" for this value. The gender is known by context. It is not "processed", but "original", because it comes directly from the source. But what string can be used? "Female"? Suppose it's a Cyrillic collection? Would we put the Cyrillic string for "female"? This is really not satisfactory. It leaves room for interpretation or standardization, but the "original" value is fully standardized: "http://gedcomx.org/Female." We currently have no way of expressing a FormalValue as "original".

The same can be true for the "interpreted" value. Suppose the following scenario (I have actually seen this one):

A census with the following columns:

name
male
female
relationship to head

The "male" and "female" columns are check-boxes. Now suppose the following values:

name: John male: (empty) female: X relationship to head: son

The record is contradictory. It seems to say that John is female, but is also a son. A user may see this and want to "interpret" the gender as "male". Once again there is no "original" string on the record that can be directly interpreted.

I still think we need better clarity about our definitions of "original", "interpreted", and "processed". Here are the definitions given by @stoicflame as the last post of #86:

original: Text directly extracted from the record field. What you see is what you get, including misspellings and other errors.
interpreted: User interpretation of what the original value means, used optionally as needed to enhance the original value by correcting misspellings and other ambiguities. The interpretation is different from a conclusion because it should be made only within the context of the record and not be based on knowledge obtained from other sources.
processed: Programmatic interpretation of the value based on an algorithm that considers the original and interpreted values.

These definitions do not take into consideration "original" values that do not come from a record field, or user "interpretations" that are not direct interpretations of an original value that came from a record field. Contrast these definitions with the ones I gave:

Original - A value that came directly from the source (whether from a "record field" or not).
Interpreted - A value that a user infers (whether inferred directly from an "original" value or not).
Processed - A value that the system infers.

If these definitions are accepted, then we need both "original" and "interpreted" to be of type "FormalValue". If people don't like these definitions, then we are forced to consider the second set of definitions I gave as "option 2":

Original (String) - came right off the image/source and was "entered/filled-in information"--not something known from context. Stuff known from context would be supplied as "normalized".
Interpreted (String) - a value that is the direct interpretation of the "original" value. Makes no sense if "original" is blank. Would generally be provided by a user, but it could conceivably be filled in by the System (or we may decide to stipulate that only users supply these values. Let the system supply only normalized values).
Formal (FormalValue) - most normal/standard/prettiest value. The reason for its existence may be:
- The system (or a user) "normalized" and/or "standardized" an "original" or "interpreted" value, making it "prettier" and/or calculating "what it means" e.g. by supplying an enum value
- The system (or a user) created the entire field by inference from other data in the record. In this case, the "original" and "interpreted" values will be empty.
- The value may have been known by context e.g. gender of Bride. In this case, as in the previous, the "original" and "interpreted" values will be empty.

If these definitions are preferred, then we need to deal with the issues I outlined in #86, which will take a few more model tweaks. Here are the sort of tweaks I mean:

Rename "processed" to "formal" (calling it "processed" doesn't make sense if it's going to have direct-from-the-source values in it)
Add an "inferred" flag to "FormalValue" (Not ideal, but probably the least weird place for this.)
Add an Attribution to "FormalValue" (This has issues, but may be the least of all evils.)
Decide how to indicate the "System contributor." (special value?)

If we go for option 2, then I believe we need to consider more deeply the "record modification model" (what kinds of changes can be made to a record, and how those changes are modeled) because of the weirdness of "Field" having an Attribution, and also "FormalValue" (which is part of Field) likewise having an Attribution.

stoicflame commented 12 years ago

Thank you for opening up this thread. I'm interested to hear comments from @dkohlert and @jeffph about this.

ianstiles commented 12 years ago

The Original value should be a String and only represent what you can actually see on the image.

Any "inferred" fields are created separately with the Original string equal to null and the FormalValue filled out accordingly.

For Interpreted field values, they should just be Strings also. If it will result in an enumerated type in the FormalValue, then that String value should be recognized as something that will result in the proper enum.

So, let's keep the Field structure as: String orig; String interpreted; FormalValue formalValue; \ No inferred flag necessary. The System will only build this and so no other contributor info is necessary.

carpentermp commented 12 years ago

@ianstiles, you haven't addressed this:

Remember my example--the "gender" of the "bride". There is no "String" for this value....What string can be used? "Female"? Suppose it's a Cyrillic collection? Would we put the Cyrillic string for "female"? This is really not satisfactory. It leaves room for interpretation or standardization, but the "original" value is fully standardized: "http://gedcomx.org/Female." We currently have no way of expressing a FormalValue as "original".

Your answer to "put a string that the system will formalize correctly" does not satisfy. It makes it appear to users and to the system that there was a place on the form for the information. By your own first statement:

The Original value should be a String and only represent what you can actually see on the image.

The gender cannot be seen on the image.

ianstiles commented 12 years ago

I'm saying that "The Original value be a String and only represent what you can actually see on the image." Therefore you cannot get a field with an Original value if it is not on the image, so for an inferred field like Gender, only the FormalValue is filled (orig and interpreted are null). For your Cyrillic case, there is no string for "female", but only an enum for Female.

The Interpreted value is something that the indexing-user would put, which if Cyrillic, would be the Cyrillic string for "female". This would be understood correctly by the system and the corresponding FormalValue would be auto-created correctly with the enum Female. Please note that the Interpreted value is only filled if there is an Original value AND it is visible on the image.

The gender cannot be seen on the image.

Exactly. If there is no gender on the image, there should be no field with an original value for gender, only the FormalValue.

carpentermp commented 12 years ago

OK, then it seems you are advocating what I called "option 2"?, i.e.:

Original (String) - came right off the image/source and was "entered/filled-in information"--not something known from context. Stuff known from context would be supplied as "normalized".

Interpreted (String) - a value that is the direct interpretation of the "original" value. Makes no sense if "original" is blank. Would generally be provided by a user, but it could conceivably be filled in by the System (or we may decide to stipulate that only users supply these values. Let the system supply only normalized values).

Formal (FormalValue) - most normal/standard/prettiest value. The reason for its existence may be:

The system (or a user) "normalized" and/or "standardized" an "original" or "interpreted" value, making it "prettier" and/or calculating "what it means" e.g. by supplying an enum value

The system (or a user) created the entire field by inference from other data in the record. In this case, the "original" and "interpreted" values will be empty.

The value may have been known by context e.g. gender of Bride. In this case, as in the previous, the "original" and "interpreted" values will be empty.

If this is the approach, I proposed these model changes:

Rename "processed" to "formal" (calling it "processed" doesn't make sense if it's going to have direct-from-the-source values in it)

Add an "inferred" flag to "FormalValue" (Not ideal, but probably the least weird place for this.)

Add an Attribution to "FormalValue" (This has issues, but may be the least of all evils.)

Decide how to indicate the "System contributor." (special value?)

Would you care to comment on how you feel about these changes?

jeffph commented 12 years ago

I'm also an advocate for option 2. Here's my take on the proposed model changes:

Rename "processed" to "formal" (calling it "processed" doesn't make sense if it's going to have direct-from-the-source values in it)

I'm good with this. The term "formal" is more consistent with the "FormalValue" class name anyway.

Add an "inferred" flag to "FormalValue" (Not ideal, but probably the least weird place for this.)

I agree we need an indicator like this. We may also (or only?) need a confidence level.

Add an Attribution to "FormalValue" (This has issues, but may be the least of all evils.)

I'm not sure why this is necessary. It seems the Attribution at the Field level would be sufficient to track the last edit on the Field. For example, we don't have separate Attributions for each member of Relationship (e.g. type, persona1, persona2, etc.) or Persona (name, gender, etc.); we only have one.

My understanding is that the Attribution indicates who/what last edited this object, regardless of whether it's a user or process. Do we need to track user and process changes separately for attributable objects?

Decide how to indicate the "System contributor." (special value?)

Yes, I believe this would be just like any other user reference, using the ResourceReference class. We would just have a special URI for that process.

carpentermp commented 12 years ago

Add an Attribution to "FormalValue" (This has issues, but may be the least of all evils.)

I'm not sure why this is necessary. It seems the Attribution at the Field level would be sufficient to track the last edit on the Field. For example, we don't have separate Attributions for each member of Relationship (e.g. type, persona1, persona2, etc.) or Persona (name, gender, etc.); we only have one.

My understanding is that the Attribution indicates who/what last edited this object, regardless of whether it's a user or process. Do we need to track user and process changes separately for attributable objects?

I made these arguments in my post on issue #86. I defined the 3 main parts of Field this way:

Original (String) - came right off the image/source and was "entered/filled-in information"--not something known from context. Stuff known from context would be supplied as "formal".

Interpreted (String) - a value that is the direct interpretation of the "original" value. Makes no sense if "original" is blank. Would generally be provided by a user, but it could conceivably be filled in by the System (or we may decide to stipulate that only users supply these values. Let the system supply only formal values).

Formal (FormalValue) - most normal/prettiest value. The reason for its existence may be:

The system (or a user) "normalized" an "original" or "interpreted" value, making it "prettier" and/or calculating "what it means" e.g. by supplying an enum value

The system (or a user) created the entire field by inference from other data in the record. In this case, the "original" and "interpreted" values will be empty.

The value may have been known by context e.g. gender of Bride. In this case, as in the previous, the "original" and "interpreted" values will be empty.

Then, after explaining why we need an "inferred" flag, I dealt with why a single Attribution doesn't suffice:

Assuming we iron all this out, we are left with an additional problem related to the fact that we have only one Attribution on a Field. The upshot of that fact is that when any change to a field is made, the "changer" is forced to take responsibility for everything in the Field. For example, suppose a user in an "ad hoc" extraction provides a new "original" value. Suppose another user "interprets" the original value. The original Attribution on the field is replaced by an Attribution to the "interpreter"--thus he has effectively taken credit for the "original" value as well. That's o.k.; it's not a stretch to say that whoever takes the effort to "interpret" a value is also re-asserting the original.

The problem really comes in when the system normalizes the value. If it changes the Attribution, the system ends up taking responsibility for the original value and its interpretation. We lose the real "who", "why", and "confidence" for the interpreted value. Not good. If we stipulated that only users can add interpretations, then at least we would remember that it was a user that gave us the interpretation, but we would lose why they thought it was right, and how confident they were. So do we leave the Attribution unchanged when the system normalizes a value? We can't. Because users can supply normalized values, that would make it appear that the user normalized the value. The "confidence level" (which really belongs to the "interpretation") would also appear to be the confidence of the normalization. Bummer.

Thus, with this approach, it appears there is no escaping the need for separate Attributions for original/interpreted, and normalized values.

stoicflame commented 12 years ago

(I'm pretty sure @carpentermp didn't intend to close the issue yet, so I'm reopening it.)

@jeffph said:

I'm good with this. The term "formal" is more consistent with the "FormalValue" class name anyway.

I'm fine with this, too, but let me get this straight: you're proposing a backwards-incompatible change to the model, right?

Add an "inferred" flag to "FormalValue" (Not ideal, but probably the least weird place for this.)

I'd suggest that since the notion of "inferred" is unique to the record model, we create another object, perhaps InferrableFormalValue that extends FormalValue to provide the inferred flag. We could also add attribution here if it's decided it's needed (still awaiting @jeffph's response to @carpentermp).

stoicflame commented 12 years ago

Hey, @carpentermp, I'm still not convinced about the need for an attribution on the formal value. If a system formalizes a value, then I think the attribution should be left unchanged.

To which you would respond:

We can't. Because users can supply normalized values, that would make it appear that the user normalized the value. The "confidence level" (which really belongs to the "interpretation") would also appear to be the confidence of the normalization. Bummer.

Why should those things be assumed? To me it just says that the user extracted this field with the specified "5 W's". The field, as a whole, is rightly attributed to the user.

carpentermp commented 12 years ago

The user didn't apply the formalValue, but the Attribution says that they did. How can you tell the difference between when the user applies a formalValue, and when the system does it? (If you say that only the system can apply the formalValue, then you are essentially arguing for option 1.)

stoicflame commented 12 years ago

How can you tell the difference between when the user applies a formalValue, and when the system does it?

From the change history. And you don't need to point out to me that the change history isn't modeled yet. :-)

I'm saying I don't think there is an appreciable value in being able to tell the difference when working at the level of records and fields. The field is attributed to the user. Good enough.

I don't think a well-behaved system should run background processes that infringe upon the attribution of the field.

carpentermp commented 12 years ago

Then I still say you are arguing for option 1 (which I prefer myself). If you are going to have the system putting data into a Record without leaving behind an Attribution, then let it be in a place where only the system is allowed to put data so that there will be no ambiguity. Option 1 provides this.

Your suggestion that you can check the change history to find out the 5 W's for the formalValue does not satisfy. In every other case, Attribution serves this purpose, without the need to resort to the change history. To say, "look in the change history to tell if this formalValue was given by a user or by the system" is tantamount to saying, "why don't we just put all Attributions in the change history." Attributions are genealogically significant. They help people make inferences about the reliability of the data so they can make other genealogical conclusions. This is why Attribution is in the Record and not in the change history.

ranbo commented 12 years ago

I'd be careful about depending on change history to take the place of attribution. They have two different goals, which are a little tricky to keep straight.

Attribution helps a genealogist decide how much to trust the data. If they see only a standardized place ("Georgia, United States"), they may wonder whether that's right and if the record really said that. Seeing that the record said "Geo" and that a user interpreted that as "Georgia" probably helps raise their confidence. Knowing whether it was a user who picked "Georgia, United States" from a list vs. a system process that decided that also helps them decide how much to trust that interpretation of the place. So one bummer with the current model is that you can't tell the difference.

A change log, on the other hand, is not meant to help a genealogist decide how much they trust the data. It is meant to protect data against being purged by someone who is intentionally or accidentally making it worse. The latest thinking on why the current data is right and should be believed belongs in the attribution.

stoicflame commented 12 years ago

I think we're all in alignment with the nature and significance of attribution. What I'm looking for is an answer to the following question:

Why is it so important to be able to tell whether the formal value was provided by the system?

carpentermp commented 12 years ago

Attribution is where the "who" question is answered. If you ask, "why do we need to know who?" it is essentially the same as asking, "why is Attribution important?", which is why we went into a discussion about the importance of Attribution.

Let me try out a scenario. Suppose a simple Field:

originalValue: "geo"

User "Bob" decides "geo" means "Georgia", so the field is changed to:

originalValue: "geo" userValue: "Georgia" who: Bob

The system, using Place Authorities, decides Georgia means "Georgia, United States" and so adds a formalValue but leaves the Attribution unchanged:

originalValue: "geo" userValue: "Georgia" formalValue: Georgia, United States who: Bob

Anyone looking at this Field would believe that Bob gave us "Georgia, United States." That lends undue credibility to the judgement call that "Georgia" meant the one in the United States, vs. Georgia the country. Someone evaluating the value for correctness might think, "well, a person thought this was true, so I guess it is true." Whereas, if they could see that the decision was made by the system they might think, "ah, the system got this one wrong. I'll fix it."

Attribution is all about staying clear about who did it, when they did it, and why they did it. A system that lies or is ambiguous about this is genealogically broken.

So far, most of the comments have been, "I prefer option 2" but there has been a lot of resistance to making the changes that are needed to make the option genealogically sound. I agree that these changes are ugly. Because of this, I strongly prefer option 1 (as a refresher, option 1 was to make "original" and "user" of type FormalValue instead of String, and to rename "formal" to "processed"). I proposed option 2 only because everyone seemed to be thinking that way and to show the ugliness of the changes that would be necessary to really make that work. What no one has done yet, is make any arguments against option 1. As deep as it has gotten is to say, "I prefer option 2."

I would like to suggest that everyone take another look at option 1 and offer some real feedback as to its pros and cons.

stoicflame commented 12 years ago

What no one has done yet, is make any arguments against option 1. As deep as it has gotten is to say, "I prefer option 2."

Fair enough. I can only speak for myself (I'd hope the others would respond with their own concerns), but I like option 1 less than option 2 because of the significant complexity that it adds to the model. I also believe that option 1 adds confusion to the nature of the original and interpreted by allowing values that are not strings. The only thing "ugly" that I see about option 2 is adding attribution to the formal value, so I'm poking specifically at that right now.

I just don't buy your use case; I just can't bring myself to imagine that a well-behaved system would do what you're suggesting. If a system decides that "Georgia" means "Georgia, United States", then that seems like a significant enough change that the system is taking responsibility for the field value and the attribution should change to the system contributor. But I thought we were talking about normalization processes that had enough confidence to apply a formal value as just a decoration of the original and interpreted values and therefore wouldn't mean that the system is taking responsibility for the value of the field.

carpentermp commented 12 years ago

More ugliness about option 2:

Inferred flag. The inferred flag is made even more ugly by the fact that it is only needed when original and user are null. When not null, inferred is always the case. Of course, it will seem a contradiction to have original not null, and normal not indicate that it is inferred, so we have created a situation where people will feel compelled to set the flag for the sake of consistency and we have complicated everyone's life.
not all original values fit in original, not all user-supplied values fit in user. In option 2, since original is of type String, it can only be used for direct transcriptions of human-written values. Other original values known from context must use the formal value. Because user is also a String, it can only be used as a String interpretation of a human-written value, not a full interpretation that might produce a FormalValue. These kinds of interpretations will need to be placed in formal. Because of this, the names original and user are misleading.

stoicflame commented 12 years ago

(BTW, it's not original and user, it's original and interpreted.)

When not null, inferred is always the case. Of course, it will seem a contradiction to have original not null, and normal not indicate that it is inferred, so we have created a situation where people will feel compelled to set the flag for the sake of consistency and we have complicated everyone's life.

I think your idea of the inferred is different from mine, because this doesn't make sense to me. I understand inferred to be only applicable when there is no original; i.e. when there's data on the record from which a value can be inferred, but isn't explicit. E.g. a gender of a persona identified as the "groom" by the record, but the record never explicitly states the persona is male.

In option 2, since original is of type String, it can only be used for direct transcriptions of human-written values.

Exactly. That makes the use of that property clear and explicit.

Other original values known from context must use the formal value.

Umm... there's no such thing as "other original values known from context". If they're known from context (and only from context) there is no original value. You can only infer formal values.

... (user) can only be used as a String interpretation of a human-written value.

Indeed. That's it's purpose.

the names original and user are misleading.

Well, as I mentioned above, it's not original and user, it's original and interpreted. But anyway, what's misleading about them?

carpentermp commented 12 years ago

I think your idea of the inferred is different from mine, because this doesn't make sense to me. I understand inferred to be only applicable when there is no original.

That's exactly what I said. But the "inferred" flag is always present in the model, even when there is an original value. So what does it mean (whether set or clear) when an original value is present? Nothing, right? That's ugly.

Umm... there's no such thing as "other original values known from context". If they're known from context (and only from context) there is no original value. You can only infer formal values.

Please tell me I don't have to back up and make this argument again. I have already made long posts on this topic. There is nothing "inferred" about the gender of the bride in a marriage record. It's every bit as "original" as anything written on the form by a human. In fact, the purpose of the "inferred" flag is to distinguish between these "original" values, and those that are truly inferred (like a birth year from an age).

We have been dealing with a lot of concepts without formalizing the terminology. Here are some definitions that should help us to talk about the issues more easily:

original value. A value that comes directly from the Record. May be an original manuscript value or an original value known from context.
original manuscript value. A value transcribed from a hand-written entry on the Record.
original value known from context. A value known from the Record without being written e.g. a "bride name" field on a marriage record indicates the bride's gender as "female" without it being written on the form.
user interpretation. A value supplied by a user that provides the "meaning" of an original manuscript value.
user formalization. A kind of user interpretation where the user supplies a FormalValue as the interpretation e.g. "v" interpreted as "http://gedcomx.org/Male".
partial user interpretation. A kind of user interpretation where the user supplies a String as the interpretation e.g. "geo" interpreted as "Georgia". It is considered "partial" because the System may still supply a FormalValue that further interprets the meaning of the original value.
system interpretation or system formalization. A value supplied by the system that provides the "meaning" of an original manuscript value or partial user interpretation.
system decoration. A system formalization where the Attribution of the Field is unaffected.

With this terminology, I can respond to the issues you raised.

First, why are original and interpreted misleading names in option 2? The name original is misleading because not all original values go there--only original manuscript values. original values known from context must go in formal. The name interpreted is misleading because not all interpretations go there--only partial user interpretations. user formalizations (as well as system interpretations) must go in formal.

If a system decides that "Georgia" means "Georgia, United States", then that seems like a significant enough change that the system is taking responsibility for the field value and the attribution should change to the system contributor.

The system wants to formalize a value, and take credit for that formalization, without taking credit for the partial user interpretation. This is why more than 1 attribution is needed.

I thought we were talking about normalization processes that had enough confidence to apply a formal value as just a decoration of the original and interpreted values and therefore wouldn't mean that the system is taking responsibility for the value of the field.

It seems that the only kind of system formalization you believe in is system decoration and you seem to think that solves the problem. It doesn't. Remember that, in option 2, system decorations and user formalizations both go in formal. As you have described system-decorations, no change is made to the Attribution. Because of this, there is no way to distinguish between system decorations and user formalizations.

stoicflame commented 12 years ago

Well, 0.8.0 is due today, and I was hoping to bring this issue to a close, but it appears that instead I just managed to make it worse. So this issue will stay open for further discussion with more changes to be applied to future releases.

The one change that will be applied to 0.8.0 as proposed by @carpentermp and agreed to by @jeffph and @ianstiles is to rename the processed property to formal. Other pieces of this issue are still open for discussion.

carpentermp commented 12 years ago

It seems a little unfortunate that you did the rename while we're still up in the air since that was an option 2 change and if we go with option 1 (hope springs eternal) we will have to rename it back.

stoicflame commented 12 years ago

It seems a little unfortunate that you did the rename while we're still up in the air since that was an option 2 change and if we go with option 1 (hope springs eternal) we will have to rename it back.

I had forgotten that your preference would be to keep it named processed.

Fair enough. I'll leave it out of 0.8.0 to give you some more time to push your case. Personally, I'm going to choose to draw back from the discussion a bit to allow some of the other implementors to chime in. If you want this rolling, I'd suggest getting people like @jeffph, @ianstiles, and @dkohlert to sing your song.

ranbo commented 12 years ago

If a system decides that "Georgia" means "Georgia, United States", then that seems like a significant enough change that the system is taking responsibility for the field value and the attribution should change to the system contributor. But I thought we were talking about normalization processes that had enough confidence to apply a formal value as just a decoration of the original and interpreted values and therefore wouldn't mean that the system is taking responsibility for the value of the field.

But this is exactly the sort of things our treatments would do: take a common string like "Georgia" or "GA" and decide that means "Georgia, United States". Whenever we normalize, we run the risk of making a mistake, but get the benefit of usually making the data better-looking and more consistent and searchable. Since a mistake may have been made by the system, an attribution there helps a later user know how much to trust what the record "says", i.e., they may trust a record more (or less) if a system guessed something than if a user picked it from a drop-down list. That's why a separate attribution for the processed value is important.

The original SoRD option here is to have attribution on original, interpreted and processed, which seems easiest to me to explain to someone. I can see requiring the user who creates an interpreted value to claim ownership for the original, too, though this won't always be accurate, since they may or may not be looking at the image when they do it.

dkohlert commented 12 years ago

IMHO, a user should not be able to edit a FormalValue. A Formalvalue is the only value that is machine understandable and thus we must insure that it remains that way so no user can place a string value there. Instead, the user should only edit the "original" or "interpreted" values that are used by the system to come up with the FormalValue. So if a records states "Paris" as a birth place, then the original should stay "Paris", however, from the context of the record it is clear that this is "Paris, ID", then the user should change the "interpreted" value to state such. The system would then update the FormalValue to reflect that change. If there is a purely implied FormalValue, then the user really should edit the field that was used to calculate that FormalValue such as "Age". If we go with this approach, then you always know that the FormalValue is attributed to a process and not a human.

carpentermp commented 12 years ago

For the sake of clarity, are you saying you don't believe in "user formalizations"?

dkohlert commented 12 years ago

I am not sure what you mean by "user formalizations." However, the three members of a Field were originally named: original, interpreted, and normalized. The normalized member was to be the system understandable value for either the original value or the interpreted, with preference going to the interpreted if present. In general, I believe when an interpreted value is needed, it is because the original value is potentially so messed up that a system could not "normalize' it. Anyway, the intent of the new formal member of field (the only one that is a FormalValue), IMO, is to hold the system provided value for either the original or interpreted value.

-----Original Message----- From: carpentermp [mailto:reply@reply.github.com] Sent: Wednesday, January 18, 2012 2:06 PM To: Doug Kohlert Subject: Re: [gedcomx] Change type of "original" and "interpreted" from String to FormalValue (#96)

For the sake of clarity, are you saying you don't believe in "user formalizations"?

Reply to this email directly or view it on GitHub: https://github.com/FamilySearch/gedcomx/issues/96#issuecomment-3553111

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

ranbo commented 12 years ago

Values don't have to be "so messed up" to be ambiguous. "Georgia" is a pretty clean value, but has two very different interpretations. We can have an algorithm attempt to guess what it means, but users can often bring context to bear that a system can get wrong. When the structure of a form tells us a gender, it is odd to come up with a string that we hope will be interpreted correctly by the system. When a user picks a place or gender or date from a list in order to disambiguate it, it is similarly risky to try to come up with a string that we hope will allow the system to automatically normalize it correctly, when we know what the answer is.

carpentermp commented 12 years ago

I defined user formalization in a previous post along with several other definitions. I would like to suggest that everyone review those definitions as a way to help us communicate better.

With regard to @dkohlert's response, you seem to be saying that users may only supply original and interpreted values, and only the system may supply formal values. Option 1 stipulates exactly this, but this requires that we change original and interpreted to be of type FormalValue for the reasons listed by @ranbo, and to properly satisfy the original-value-known-by-context case.

dkohlert commented 12 years ago

I believe that the system (UI) allowing a user to add either an original or interpreted value may indeed present users with the text portion of a FormalValue, after all, that is all it could present. I doubt it would present the resourceURI.. That text would then be stored in original or interpreted as appropriate. This text could then be easily turned into the FormalValue that is stored in the formal member by the system doing the normalization.

So I suppose what I am saying, is no I do not believe in user formalizations if that means that a user can store a FormalValue in the formal member of a field.

carpentermp commented 12 years ago

system (UI) [...] may indeed present users with the text portion of a FormalValue, after all, that is all it could present

What about the UI presenting users with a combo-box of genders or marital statuses. In that case the user is literally choosing the enum of the FormalValue, not any String.

jeffph commented 12 years ago

In that case, the string value of the enum would then easily be convertible to a FormalValue by the formalization process. This is the most consistent way to handle drop-downs since most will actually be combo-boxes (i.e. a drop-down list combined with a text box that allows a user to extend/override values in the list).

carpentermp commented 12 years ago

Also, you have still not dealt with the original-value-known-from-context case e.g. the bride's gender in a marriage record. The formal value (http://gedcomx.org/Female) is known from context. What String should be used in original to indicate this? "Female"? Suppose it is a Chinese collection. Should the string be the Chinese character for "female"? Doesn't putting a String value in this case seem like a kludge to you?

Why not just accept that original and interpreted can be FormalValues and all of these problems are solved?

jeffph commented 12 years ago

If they're choosing from enumerated values in a combo-box, then the enumerated values are easily localizable and convertible to/from a string. In this case, working within the context of a Chinese locale, the user selects the enumerated value for "female", a Chinese translation of "female" would be automatically entered into the original or interpreted member, which would later be easily formalized into a FormalValue.

carpentermp commented 12 years ago

In that case, the string value of the enum would then easily be convertable to a FormalValue by the formalization process. This is the most consistent way to handle drop-downs since most will actually be combo-boxes (i.e. a drop-down list combined with a text box that allows a user to extend/override values in the list).

This ignores the fact that the user told us which enum value they wanted--yes, with a combo-box they could type something not in the list, but suppose for a moment that they just chose something from the list. Your suggestion that we toss this information and depend upon the system to get it back is problematic and genealogically unsound. The user could be seeing options in any language--hopefully their own language, which may or may not be the language of the collection. Recording the String for what the user saw will be jarring to subsequent users when the user's language is not the language of the collection e.g. imagine a U.S. census where the user interpretation of the marital status is a Chinese character.

In addition, such String values can only be reliably converted when:

the language is known--but the language is not recorded anywhere.
the system has the right mappings for the language in question. When only 1 system is involved, this isn't hard. But suppose I send this Record to another system (or perhaps a later edition of the same system). The new system considers all formal values to be system inferred and so they toss them and do their own inferences. If they don't have the same mappings, they don't get the same result.

These complications arise because we aren't recording what actually happened. We are making it look like the user provided a String value and the system inferred what that String meant. In reality, the user chose the meaning. In our model, FormalValue represents the fullest possible expression of what a value means. Insisting that the system, but not users, can tell us the full meaning of a value is arbitary. Users can often determine the meaning better than the system can. Let them.

Oh and once again, you still haven't dealt with the original-values-known-from-context case.

dkohlert commented 12 years ago

Unfortunately, not all of the problems are solved if type of "original" and "interpreted" is changed to FormalValue. Yes, it may handle the very minority case of being able to absolutely pick a single correct enumerated value, however, 9%% plus of indexed data is not an enumeration so there never would be a pick list, so most user provided FormalValues would only contain text. We would have to write special code to say, is the text value of this FormalValue something that we can tie to a resource (enumeration value) or not and in the case where there is not an enum value do we need to normalize the text in the FormalValue. You so easily discount the case where there is not a valid enum valid for the user to pick. I believe that happens a lot more than you think.

If we ONLY use FormalValue for the "formal" member of field and this is ONLY created by a system, then we can conclude that no further processing needs to take place on this Formal value. If either the "original" or "interpreted" types are FormalValue, the system would have to analyze the "original" or "interpreted" FormalValues to see they need to be normalized. Which is exactly what the FormalValue class was intended to mean. It is the normalized value, if a user can provide the text for a FormalValue and that text is not in the proper normalized format, that is no longer the case. And if that is what we want, then we really don't need a FormalValue class anymore and all three fields "original", "interpreted" and "formal" could be of type string.

We have a trade off, if we use FormalValues for "original' and "interpreted", then the benefits is we handle the bride gender cases easily. However we do not handle the case when there is not a valid choice available for the user to pick, and for 95% of the fields that are not enumeration based, we will have to normalize the text within a FormalValue.

If we simply use only use FormalValues for the formal member and strings for "original" and "interpreted". The system would never have to modify a FormalValue and we could know that that value came from a system. And we would ALWAYS have to convert the "original" and "interpreted" strings to a FormalValue. But since we have to do this anyway for 95% of the fields, why complicate the code to handle a 5% use case. Why not convert some text to the FormalValue. As for your bride known gender use case, I would argue that the system allowing the user to index this record would know that it is dealing with a marriage and a bride and that the system would provide the FormalValue and place it in the formal member based on the user telling the system this person as a bride, not that the user state explicitly the gender of the bride. So it is really much more of a UI issue. I believe that every enumerated FormalValue will have a "default" text value. It is this default text value that would be stored, not the language specific version of the value. So the algorithm for formalizing an "original" or "interpreted" string would be.

If (field is enumerated) Then If (field.text is a valid enumerated text value) Then Store the appropriate FormalValue in the formal member Else Try to map the text to a valid FormalValue but keep the text in the original or interpreted fields. Else Normalize the text and store it in a FormalValue in the formal member.

I wish you would go over both cases of how a FormalValue would be used for "original" and "interpreted". The two cases are: 1) enumerated values including the user being able to specify a new value that is currently not part of the enumeration, such as "blue" for gender. 2) FormalValues for everything else. How does the system know when it needs to normalize or not a FormalValue?

-----Original Message----- From: carpentermp [mailto:reply@reply.github.com] Sent: Thursday, January 19, 2012 1:49 PM To: Doug Kohlert Subject: Re: [gedcomx] Change type of "original" and "interpreted" from String to FormalValue (#96)

Why not just accept that original and interpreted can be FormalValues and all of these problems are solved?

Reply to this email directly or view it on GitHub: https://github.com/FamilySearch/gedcomx/issues/96#issuecomment-3573963

dkohlert commented 12 years ago

I did some more thinking on this last night and I believe I have a solution that allows for user formalizations without requiring "original" and "interpreted" to be of type FormalValue which we don't want because 95%+ of the time it is not a FormalValue. When a UI (system) allows a user to pick a valid enumeration, the SYSTEM will put FormalValue in the "formal" member and the text portion of the formal member in either "original" or "interpreted". This is basically what we were saying all along except that instead of formalizing the text in "original" or "interpreted" later and storing it in the "formal" member, the UI just does it up front. That will eliminate the language issue that was brought up before.

I believe solution satisfies the needs of both supporting user fomralizations and the case that the user may often need to specify a value that is not one of the available enum values. In that case the UI would is simply store the text in "original" or "interpreted" and would not store anything in the "formal" member.

ianstiles commented 12 years ago

+1 I agree with Doug and would further clarify to store this enumerated drop down string value in the interpreted, but when the user hand-types a value, that goes in original.

carpentermp commented 12 years ago

I'm sorry but this does not solve the problem. You focused on one of the problems I mentioned to the exclusion of the others. Yes, your solution makes it so that the system can safely put the correct value in formal. But as I said in my previous post:

When only 1 system is involved, this isn't hard.

However, the core problem remains: to anyone looking at the resultant Field it appears that the system inferred the gender from a String value. Putting the correct value in formal doesn't change this.

I am looking for a solution that records the truth about the field, a solution that not only records what the field said and any inferences about what that means, but also who/what made those inferences, when, and why. Options 1 and 2 both do this.

In answer to questions you raised in your post:

I wish you would go over both cases of how a FormalValue would be used for "original" and "interpreted". The two cases are: 1) enumerated values including the user being able to specify a new value that is currently not part of the enumeration, such as "blue" for gender. 2) FormalValues for everything else. How does the system know when it needs to normalize or not a FormalValue?

Use case: user is presented with a marital status combo-box. Two options:

User chooses "Married" in the list. interpreted is a FormalValue so: interpreted.text = null and interpreted.resource = "http://gedcomx.org/Married". The system knows that no further formalization is necessary because marital status formalizes to an enum and resource is not null.
User types "Blue". interpreted.text = "Blue", interpreted.resource = null. The system knows that further formalization is needed because resource is null.

Use case: user interprets "Wm" to mean "William".

interpreted.text = "William" and interpreted.resource = null. Names don't formalize to enums. Instead, when treating names we just "make them prettier." The system is probably not as good at doing this as a human is, so perhaps the system declines to do further formalization. But really, it's up to the system. It could decide to change the case, or even the spelling if it thinks it has a better idea than the user had. If the system gets it wrong, at least anyone looking at the record can see what the user thought, and what the system thought.

From these two examples, we can see that what "fully formalized" means depends on the type of data. names, dates, places, and controlled vocabularies will all behave a little differently, but the system attempting to formalize the value of a field knows what it's trying to achieve--what the fully formalized state of that data is. When the data achieves that state, no further formalization is necessary.

Finally, I would like to point out for the third time that your arguments up to now have only dealt with the user formalization case. You still need to address original-values-known-from-context case.

dkohlert commented 12 years ago

However, the core problem remains: to anyone looking at the resultant Field it appears that the system inferred the gender > from a String value. Putting the correct value in formal doesn't change this.

Exactly!!! The "formal" member is always inferred from a string value of either "original" or "interpreted". It is just in a minority number of cases where an enum is involved, that the system letting the user provide the string value helps the use by providing the string value that can definitely be formalized.

User chooses "Married" in the list. interpreted is a FormalValue so: interpreted.text = null and interpreted.resource = > "http://gedcomx.org/Married". The system knows that no further formalization is necessary because marital status formalizes to an enum and resource is not null. User types "Blue". interpreted.text = "Blue", interpreted.resource = null. The system knows that further formalization > is needed because resource is null.

This is exactly what I am trying to avoid. IMO, the a FormalValue once created should not have to be modified and is ALWAYS fully normalized. Otherwise, every single time a piece of code encounters it, that piece of code CANNOT assume that it as been "fully normalized."

Finally, I would like to point out for the third time that your arguments up to now have only dealt with the user formalization case. You still need to address original-values-known-from-context case.

I think I did cover this case.

I would argue that the system allowing the user to index this record would know that it is dealing with a marriage and a bride and that the system would provide the FormalValue and place it in the formal member based on the user telling the system this person as a bride, not that the user state explicitly the gender of the bride.

In this case there would be no "original" nor "interpreted" values. So this is the only exception where the "formal" member is not inferred from a string in "original" or "interpreted".

carpentermp commented 12 years ago

You say that this answers the original-value-known-from-context case:

I would argue that the system allowing the user to index this record would know that it is dealing with a marriage and a bride and that the system would provide the FormalValue and place it in the formal member based on the user telling the system this person as a bride, not that the user state explicitly the gender of the bride.

In this case there would be no "original" nor "interpreted" values. So this is the only exception where the "formal" member is not inferred from a string in "original" or "interpreted".

It does not. Since you don't seem to want to express yourself in terms of the definitions I gave, let me try to translate. I believe you are saying that original-values-known-from-context are placed in formal. At the same time you say that only the system can put values in formal. Unfortunately, users can supply original-values-known-from-context. For example, in an ad hoc extracted record all original-values are user supplied--there is no template from which a fixed set of fields can be mapped into a Record.

Seeing the strong aversion you have to changing the datatype of original and interpreted to FormalValue, I believe you really would prefer option 2. In option 2, original-values-known-from-context are placed in formal as well as user-formalizations. It has the advantage you treasure so higly that there can only be 1 formal value so there is no ambiguity about when a value is "fully formalized." By allowing true user-formalizations (not this kludgy "string value that can definitely be formalized") it eliminates one of my big issues with what you propose.

Unfortunately, with option 2, a single Attribution just doesn't get the job done. As a refresher, I include this description of option 2 from a previous post:

Option 2

original (String) - came right off the image/source and was "entered/filled-in information"--not something known from context. Stuff known from context would be supplied as "formal".

interpreted (String) - a value that is the direct interpretation of the original value. Makes no sense if original is blank. Would generally be provided by a user, but it could conceivably be filled in by the System (or we may decide to stipulate that only users supply these values. Let the system supply only formal values).

formal (FormalValue) - most normal/standard/prettiest value. The reason for its existence may be:

The system (or a user) "normalized" and/or "standardized" an "original" or "interpreted" value, making it "prettier" and/or calculating "what it means" e.g. by supplying an enum value

The system (or a user) created the entire field by inference from other data in the record. In this case, the original and interpreted values will be empty.

The value may have been known by context e.g. gender of Bride. In this case, as in the previous, the original and interpreted values will be empty.

I also gave these as necessary model changes:

Rename "processed" to "formal" (calling it "processed" doesn't make sense if it's going to have direct-from-the-source values in it)

Add an "inferred" flag to "FormalValue" (Not ideal, but probably the least weird place for this.)

Add an Attribution to "FormalValue" (This has issues, but may be the least of all evils.)

It may be possible for the Attribution to indicate that the value is "inferred" (via "confidence"?), thus eliminating the need for the explicit inferred-flag (which, as I have explained, has some ickiness.)

Also as a refresher, I will post the reason I gave for why another Attribution is needed:

There is an additional problem related to the fact that we have only one Attribution on a Field. The upshot of that fact is that when any change to a field is made, the "changer" is forced to take responsibility for everything in the Field. For example, suppose a user in an "ad hoc" extraction provides a new "original" value. Suppose another user "interprets" the original value. The original Attribution on the field is replaced by an Attribution to the "interpreter"--thus he has effectively taken credit for the "original" value as well. That's o.k.; it's not a stretch to say that whoever takes the effort to "interpret" a value is also re-asserting the original.

Randy points out that it's also not entirely correct either, because the "interpreter" doesn't necessarily have access to the image, so he may not be able to verify the original value. Anyway, going on...

The problem really comes in when the system formalizes the value. If it changes the Attribution, the system ends up taking responsibility for the original value and its interpretation. We lose the real "who", "why", and "confidence" for the interpreted value. Not good. If we stipulated that only users can add interpretations, then at least we would remember that it was a user that gave us the interpretation, but we would lose why they thought it was right, and how confident they were. So do we leave the Attribution unchanged when the system formalizes a value? We can't. Because users can supply formal values, that would make it appear that the user formalized the value. The "confidence level" (which really belongs to the "interpretation") would also appear to be the confidence of the formalization. Bummer.

Thus, with this approach, it appears there is no escaping the need for separate Attributions for original/interpreted, and formal values.

Will you agree to adding a second Attribution to Field? Randy would actually suggest that original, interpreted and formal each have their own Attribution, but I suppose this is too much to hope for?

dkohlert commented 12 years ago

I think the problem we are having hear is we are trying to solve more than one issue at a time. There are at least two issues here, attribution and original-values-known-from-context and maybe "user formalizations" is a third. So for this post I am only going to address attribution.

IMO, for just the record mode, I am not sure we care which user or which system a value can from. I think all we need to know is 1) did it come from a user or from a system, 2) is this a value that was actually on the record or inferred from context and 3) what is the confidence of the value. We can determine 1) if we say that only users provide "original" or "interpreted". The inferred flag can be stored on the field itself. The difficult one is the confidence level. I would argue that confidence level is only needed when a system infers a value. I say that because if you allow a user to specify their confidence level, you will always have to take any "high" level of confidence with a grain of salt because the user MAY think he/she is 100 right when in fact they are not. Plus I am not sure we would ever get a user to specify their confidence level. So I would let other users determine if a value is accurate or not by looking at the image itself. A confidence level makes much more sense on a system inferred value where the algorithm can hard-code the value. The problem with some inferred values is that they are often dependent on other fields that were not inferred and I don't think we want to refer to all of the fields that may have contributed to an inferred field.

Given all of that, it seems like the current model is overkill so making it even more overkill is not needed as far as attribution goes. I would actually prefer to simplify.

I understand that all of this is needed for the conclusion model, so I once again wonder if we are making the record model more complex than it needs to be just for the sake of trying to make it look like the conclusion model.

jeffph commented 12 years ago

I also believe attribution is absolutely required in the conclusion model, but is very low-value/high-cost in the record model.

ranbo commented 12 years ago

We are proposing a genealogical data standard for the industry to use. Attribution is essential in a genealogically robust data model, including in the record model. This is especially important in a collaborative system where it is not always clear where the information is coming from. When a user looks at source data to decide whether to draw conclusions based on it, they need to be able to evaluate how much they trust it. Things that influence how much one user can trust the information in a record include (a) whether it came from a user or a process; (b) whether it was what the record said, or what the record strongly implied, or even what the user knew already beyond what the record said.

This will become more important as users are allowed to correct or update records.

I would say this is an essential feature, and that the cost comes mostly from the time spent battling, not from the difficulty or impact of actually implementing either of the proposals above.

stoicflame commented 12 years ago

cost comes mostly from the time spent battling, not from the difficulty or impact of actually implementing either of the proposals above.

Well if you wouldn't be so obstinate then maybe we could stop battling! (That's a joke. I hope that's clear. I couldn't resist.)

So in an attempt to contribute some perspective, let me see if I can point out the two ends of the spectrum of opinion so we can focus on the compromise in the middle.

On one end, we've got the proposal to keep things as is with no inherent support for user-supplied formal values or for being able to identify values derived from context. (I might have even heard a proposal for no attribution at the field level, but that seems like an outlier to me, so I'm going to ignore that).

On the other end, we've got the proposal to add detailed structure to support user-supplied formal values, values derived from context, and to be able to assign attribution at each of those field sub-values.

I propose that an appropriate compromise would be to support user-supplied formal values and values derived from context by changing the name of the processed property to formal and adding a flag for identifying whether it was derived from context. And how do you determine whether a user or a process submitted the formal value? You cant. Yet. At least not with the standard. If your application requires that level of detail, then you'd have to use the extension mechanism.

carpentermp commented 12 years ago

Thanks @ranbo for your comments. I was waiting for your response to respond myself. Thanks to @dkohlert and @jeffph for openly confessing what I have long suspected--you really don't care about Attribution. Forgive me for saying it, but in my opinion, anyone who doesn't care about Attribution doesn't really understand genealogy. Genealogical research is, at is core, an iterative process of conclusion-making, with lower order conclusions forming the basis for higher order conclusions. I like to think of it as a "tower" of conclusions. With a tower, upper levels are only as secure as levels below. If there is a fault in a lower level, it puts the entire tower at risk. Because of this, researchers are forced to regularly evaluate, and reevaluate, the soundness of the tower at all levels. To properly support this process, I believe any genealogically sound model should have this as an axiom:

For each piece of genealogical data modeled, the model should provide a way to faithfully capture the 5 W's--what, who, when, why, where (source).

While there is room for difference of opinion about the appropriate level of granularity for capturing Attribution, the model should never be ambiguous or misleading.

It was this that motivated me to create this issue. The current model simply does not provide a way to faithfully "attribute" the data in all cases. I proposed a couple of small changes that I thought would be the minimum needed to eliminate the ambiguities and give cleaner, more understandable, definitions for the three values of Field. When that was rejected I proposed Option 2--again, what I felt would be the absolute minimum needed. I have really tried not to be disruptive to the current model, but I had one goal in mind: to properly attribute all the data.

@dkohlert wrote:

I think the problem we are having hear is we are trying to solve more than one issue at a time. There are at least two issues here, attribution and original-values-known-from-context and maybe "user formalizations" is a third.

This issue is all about Attribution. original-values-known-from-context and user formalizations are two scenarios (among others) that require proper attribution. I have been looking for a model that properly attributes all scenarios.

@dkohlert also wrote:

I understand that all of this is needed for the conclusion model, so I once again wonder if we are making the record model more complex than it needs to be just for the sake of trying to make it look like the conclusion model.

I have heard you voice this concern many times and each time I find it to be without basis. Where the concerns are the same between the Record and Conclusion profiles of GedcomX, it is not only right, but very desirable that the two profiles should have the same model. It makes the whole much simpler and easier to understand. Attribution is just such a cross-cutting concern.

@stoicflame wrote:

And how do you determine whether a user or a process submitted the formal value? You cant...If your application requires that level of detail, then you'd have to use the extension mechanism.

It's not a question of the application needing that level of detail. Users need that level of detail in order to evaluate the data. The extension mechanism is not the appropriate place for a core model concept.

We have probably already taken too much time on this. I realize that after gathering input from all interested parties, ultimately, @stoicflame decides. It is also pretty clear that I have not been able to persuade him (or anyone else).

@stoicflame, your proposed compromise, if I understand it, would be to do 2 of the three things I suggested in Option 2. The remaining change, that you do not agree to, would be to add a second Attribution. Unfortunately, this "compromise" doesn't solve the fundamental ambiguities inherent in the current model and so leaves me entirely dissatisfied. You are free, as always, to dictate on this, but know that it was done over my strong objections.

EssyGreen commented 12 years ago

"Original" is a misnomer if it is a string since by definition it must have been interpreted by the transcriber. The only true value for "Original" would be the uri of a photographic and authenticated copy of the real artifact.

carpentermp commented 12 years ago

"Original," as we have defined it in the model, means that the user will supply exactly what they read from the the image (or document). For example, if in a given name field they read "Wm" then that is exactly what they would put. If they want to say also that, in this case, "Wm" is really an abbreviation for "William", they would put that interpretation in the "interpreted" value.

Your comment seems to suggest that reading anything off of the image is an "interpretation". In one sense that is true, but deciding "what it says" is not the same as deciding "what it means". We needed a model that allows for both, but keeps both straight. We chose "original" and interpreted" as the names for these two meanings. Would you like to suggest an alternative naming scheme?

EssyGreen commented 12 years ago

My point is that it can be very difficult to make out what the original actually said due to handwriting, creases in old documents, bad resolution etc etc The only way a researcher can be absolutely sure what the "original" was is to look at it.

I would suggest "Transcribed" (instead of "Original") and leave "Interpreted" as is.

stoicflame commented 12 years ago

processed has been renamed to formal at f4c35ed and will be included in 0.10.0.