FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

Do Fields and Notes really need to be GenealogicalResources? #135

Closed EssyGreen closed 12 years ago

EssyGreen commented 12 years ago

One of the problems I found with the old GEDCOM model was the vaguary yet complexity of the Notes ... ie there was no indication what the NOTE might be used for, whether it was a description, a research note, public or private etc etc 99% of the time it is used as pure text ... yet as a developer it had to be treated as a full blown record to allow for the fact that there might just be citations or REFNs or a RIN or whatever. GEDCOM 5.5.1.5 gave me some hope by deprecating citations against embedded NOTEs and thereby allowing simple text but GEDCOMX seems to have reverted to a complex heavy-weight object again. Whilst I appreciate that this makes for simpler modelling I think it's a cop-out and will add unnecessary complexity ... How many users will enter an attribution for every note they write? (Or every Field they enter come to that - since the humble Field also inherits from the GenealogicalResource!)

It seems that the model is trying to take inheritance to the Nth degree but in doing so you lose meaning and context (especially when you try to implement the data as a relational model) ... Ultimately virtually any data could inherit from an array of strings but it doesn't make it very useful.

stoicflame commented 12 years ago

I guess I'm having a hard time understanding the nature of your lament. There are only three properties of GenealogicalResource, and all three of them seem to make sense on both Field and Note. They have ids, you should be able so identify who made the note or who captured the field, and applications they should be extensible so applications can apply their own custom data to them.

How would you suggest simplifying in such a way to meet those needs?

EssyGreen commented 12 years ago

I guess I'm having a hard time understanding the nature of your lament.

LOL! How beautifully put @stoicflame :)

I think that one of my problems (as mentioned elsewhere) is that the object model is cluttered with the syntax stuff (e.g. RDF, FOAF, URIs etc) but taking that aside ...

They have ids

This implies that they can be cross-referenced and since we have custom extension elements they could be cross-referenced from anywhere and in any context yet without any understanding of what the context might be. Extremely flexible but no use what-so-ever when importing/exporting.

you should be able so identify who made the note or who captured the field

I would have thought that anything where we need to know who created it, is either a source or contained within a source (where the implication is that the creator of the source is also responsible for the data within it). Could you give me some examples where this would not apply (ie where two (or more) fields in a source are created by different people/organisations, and yet are not able to be split into derived/sub-sources)?

they should be extensible so applications can apply their own custom data to them

Is it really necessary to encourage extensibility for every single field? (and hence for every field within the custom field ad infinitum)

To me what you are describing is more like a wiki ... and whilst I can see that an application may want to implement genealogy in that way, I don't think it should be essential.

When reading from a GEDCOMX file the application has to choose whether to retain (but ignore) unknown data or whether to reject it (and warn of the resultant data loss). Since genealogical information is key I have always preferred the retention route since this allows users the ability to use multiple applications with the same data. But unless the standard limits the customisations/unknowns it is impossible to retain data integrity (e.g. the user can inadvertently corrupt non-standard links between entities because the application does not understand what they mean and/or hides them from the user). So if every Field has the potential to be cross-referenced in an unknown way and to have limitless unknown components then the application has no option but to go down the reject route and lose the unknowns .... If the unknowns are necessarily going to be lost what was the point of having them there in the first place? The only time I can see it being possible to retain the unknowns is if the application was transferring data to a sister application which shared the same (unknown) standard .... in which case they wouldn't need GEDCOMX anyway 'cos they'd have their own standard.

EssyGreen commented 12 years ago

If we are intent on having Notes as objects then could we please have a "Subject Heading" property so that we can at least summarise them meaningfully in a list and give the reader an indication of the subject without having to read through the whole note to see what it's getting at

thomast73 commented 12 years ago

@jralls stated (asked) the following on #181 (which he closed as a duplicate of this issue):

As currently defined Note extends GenealogicalResource, which provides an ID for referencing it from other objects and an Attribution, which adds an author (OK), a timestamp (OK), a confidence level (what?) and a proof argument (double-what?).

As I have been working on the model changes being discussed on #123 and #144, I have been discussing this question(s) with @stoicflame. I have similar questions.

The GenealogicalResource "proof argument" is maybe not well named. The intent is to capture the reason the the "contributor" is contributing that particular piece of information (i.e., note, conclusion). If this were version control, it would be the equivalent of a "commit message". Some also look at it as being the "justification" for the contribution.

The "confidence level" is supposed to be about how confident the contributor is about the piece of information they are providing.

That being said, how ubiquitous should these to fields be?

@EssyGreen has expressed that for many a run-of-the-mill "note", this would be overkill, and I can see a strong argument for this.

For the EvidenceAnalysis object (see discussion in #144) the statement of confidence could be useful individually, but it is always subjective and therefore of limited value as a shared data item (in my opinion); and "commit message" would be useful if the resulting EvidenceAnalysis was developed collaboratively.

For conclusions (persons, facts, relationships, etc.), I can see an argument for stating a confidence. But again, I am bothered by the subjective nature of this statement. When I mix someone else's statements of confidence into my own data, it dilutes the meaning of my own statements. But it might be helpful in helping me evaluate someone else's statements? Also, the "commit message" aspect could be useful, but in most cases, the reason for the "commit" is "I found it to be so in such-and-such source", and the link to the source is sufficient to describe my "reason" for my contribution; so while it is applicable, for most cases it feels like a no-op. In a collaborative environment, however, it could be standard CONOPS -- required -- and therefore important in data exchange.

I can see places where this data would be useful and used, but I also see that for most existing data, these fields will be meaningless and that these fields will be unused/unpopulated. Not sure where to land on this!?!?

EssyGreen commented 12 years ago

When I mix someone else's statements of confidence into my own data, it dilutes the meaning of my own statements. But it might be helpful in helping me evaluate someone else's statements?

Methinks not a lot :)

For the EvidenceAnalysis object (see discussion in #144) the statement of confidence could be useful individually, but it is always subjective and therefore of limited value

Ack I nearly mis-read this so forget anything that flew by via email :)

I should have said ... "Yes I agree it has limited value" - but it is the weight/rating/confidence level of each of the individual pieces of evidence which is still of value (providing they have been provided by the same researcher) since these give a quick indication of the +ve or -ve contribution towards the conclusion.

For example:

Joe Bloggs Birth - 1811 London E1 1861 Census +5 - Joe Blogs Birth - Calc 1811 Middlesex E2 1841 Census -5 - J Bloggs Birth - Calc 1805 Surrey E3 1812 Baptism +3 - Joseph Bloggs Baptism age 1yr Westminster Rationale: The 1861 Census shows Joseph was born in Middlesex in 1811. This is confirmed by the 1812 baptism record in Westminster. However, there are 10 other Joseph Bloggs baptised between 1810 and 1815 in Middlesex and these have not yet been investigated. The 1841 Census appears to contradict this but it was common practice to round ages by up to 5 years at this time and the boundaries of Surrey and Middlesex moved in the 1850s.

thomast73 commented 12 years ago

...it is the weight/rating/confidence level of each of the individual pieces of evidence which is still of value (providing they have been provided by the same researcher) since these give a quick indication of the +ve or -ve contribution towards the conclusion.

So rather than rate conclusions, rate evidence extracted from sources? Seems interesting! But "confidence" might be the wrong scale for rating evidence?

EssyGreen commented 12 years ago

So rather than rate conclusions, rate evidence extracted from sources? Seems interesting! But "confidence" might be the wrong scale for rating evidence?

Exactly so and I agree that "confidence" is not appropriate ... I would phrase it as "How much evidence does this source provide towards my theory?" - obviously need something snappier!

jralls commented 12 years ago

There are a bunch of different things that need to be evaluated: The quality of the source itself (contemporaneous, informant, purpose, legibility, etc.), the quality of the extracted evidence, the applicability of evidence to the hypothesis, and the state of "proved-ness" (to borrow from another issue) of each hypothesis. I'm not a big fan of numerical or enumerated ratings for this sort of thing, it applies a veneer of objectivity to what is a very subjective task. I think it makes a stronger proof argument to write out as a narrative argument what evidence you find more compelling and why.

I do recognize that most extant software includes those sorts of numeric or enumerated ratings to sources and to conclusions, and while I don't know that any of them make very effective use of it, I imagine that they'd like to be able to export the ratings somehow.

EssyGreen commented 12 years ago

There are a bunch of different things that need to be evaluated: The quality of the source itself (contemporaneous, informant, purpose, legibility, etc.), the quality of the extracted evidence, the applicability of evidence to the hypothesis, and the state of "proved-ness"

All true and I agree we don't have to make everything numeric but I think in this situation is is helpful because the numbers indicate the relative importance which cannot be done by words. This can be extremely useful to the researcher when in the process of reviewing and analysing and writing up their conclusions. It's a research tool intended to help the researcher reach their conclusions not an attempt to replace their write-up.

thomast73 commented 12 years ago

@EssyGreen has often bemoaned the complexity of the making Note reference-able by many objects and she has not been alone in wanting a simple "in-line" version of Note. In reviewing model changes with with @stoicflame yesterday, we had the following idea that we would like to propose:

We would like to change things such that Note is always "in-line" -- contained by a single model entity, and not reference-able by any other entities.

The new EvidenceAnalysis will continue to be reference-able by multiple entities.

It is my feeling is that the reason for having "notes" reference-able by multiple entities is mostly for the "evidence analysis document" case. If we just have "evidence analysis" reference-able and make "notes" simple and in-line, would that be sufficient?

EssyGreen commented 12 years ago

Yay! (fainting in shock here! then dancing for joy! ... will run away quick before the avalanche of nay-sayers arrive)

jralls commented 12 years ago

It is my feeling is that the reason for having "notes" reference-able by multiple entities is mostly for the "evidence analysis document" case.

Yes, I think that that's probably true. If other use cases come up we can consider them for a future revision.

thomast73 commented 12 years ago

Where do we stand on this issue?