FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
350 stars 67 forks source link

In need of a Source object #144

Closed EssyGreen closed 11 years ago

EssyGreen commented 12 years ago

I realise the infinite flexibility of inheriting everything from a Resource and hence allowing it to be considered a source but this also makes for infinite complexity and infinite nonsense!

To allow, for example, a DatePart to be used as evidence is nonsensical. Theoretically we could cite millions of documents with a DatePart of "Day=1" but it means absolutely nothing without the wider context of the Date, which means nothing without the wider context of the Fact.

I think we need a definite Source object (probably equating, or similar, to the Record) and it is this (not the Resource) which should be referenced in Citations and used in Evidence.

stoicflame commented 12 years ago

There already is a Source object, although we already recognize it's inadequate.

But I think what you want is the requirement for all source references to resolve to an instance of that thing?

jralls commented 12 years ago

Just a different facet of #146

EssyGreen commented 12 years ago

There already is a Source object

Forgive me for being dim but the link just shows the "Description" with all the DC meta data .... I get that the DC meta data is effectively attributes/properties of the "Description" but where in the model is the "Description" object (which you say equates to a Source)?

I think what you want is the requirement for all source references to resolve to an instance of that thing?

Er, yes. How can they not? Surely all references to the same source should be links to the same Source object?

stoicflame commented 12 years ago

Just a different facet of #146

Yeah, that was my thoughts, too. I'm trying to figure out the difference.

but where in the model is the "Description" object

Umm... sorry... I don't understand the question. Where in the model? It's part of the model...

How can they not?

You could refer to an image as a source. Or a multimedia file. Or a web page. Or anything else that can be identified with a URI.

Surely all references to the same source should be links to the same Source object?

Sure. But not everything cited as a source needs to be an instance of that same type.

jralls commented 12 years ago

Or anything else that can be identified with a URI.

And that is the problem. As the model is presently expressed every reference reduces to a URI. I could enter a standard place, or some persona, or some random Slashdot article as a source. OK, a Slashdot article might be a valid source. Unlikely but possible. The real problem would be an internal object, an easy error to create, and which might cause actual trouble (think a is a source of b is a source of c is a source of a = crash).

EssyGreen commented 12 years ago

@stoicflame

There already is a Source object, [...]

I asked where and you gave me a link to a "Description" definition (http://www.gedcomx.org/model/dcterms_Description.html). Maybe it's just the .Net version is broken or something but there RDFDescription is never referenced anywhere so I can only assume its just the meta data for the whole GEDCOMX file ... ergo the file itself is the one and only source and all sources must be in separate GEDCOMX files referenced by their URIs.

If this is the case then as @jralls says any uri can be used as a Source and if that is so then there is no guarantee that there is any useful data whatsoever in the related uri since there is no guarantee it actually is a GEDCOMX file (and not just a jpg, a PC application, a link to a virus, etc).

stoicflame commented 12 years ago

And that is the problem. As the model is presently expressed every reference reduces to a URI.

Okay. How else would you suggest the serialization format (de)reference other objects? simple string?

The real problem would be an internal object, an easy error to create, and which might cause actual trouble (think a is a source of b is a source of c is a source of a = crash).

So that's an error in the data. Agreed.

How is that any different from any other data error cases that will need to be intelligently handled by the application? I can write an "I'm my own grandpa" loop, too.

stoicflame commented 12 years ago

ergo the file itself is the one and only source and all sources must be in separate GEDCOMX files referenced by their URIs.

No... that's not the intent of that at all. The intent of that object was to be akin to a Source object like you're asking for here in this thread. Obviously, that needs to be clarified, so I'll use this issue to track that work.

jralls commented 12 years ago

And that is the problem. As the model is presently expressed every reference reduces to a URI.

Okay. How else would you suggest the serialization format (de)reference other objects? simple string?

See #146. But:

The serialization isn't the model. The model describes the internal data structures that are serialized and that result when the stream is deserialized. So in the model references to other objects in the serial stream should be references to the class of those other objects, not the URI that is the proxy for the reference in the stream. Deserialization will have to be two passes: The first to construct all of the objects in the stream and the second to resolve the URIs into references and validate them.

EssyGreen commented 12 years ago

The serialization isn't the model. The model describes the internal data structures that are serialized and that result when the stream is deserialized. So in the model references to other objects in the serial stream should be references to the class of those other objects, not the URI that is the proxy for the reference in the stream.

+++++++++1

jralls commented 12 years ago

Deserialization will have to be two passes: The first to construct all of the objects in the stream and the second to resolve the URIs into references and validate them.

If the file format includes an index of the objects with their types as @nealmcb is asking for in #140 the first pass wouldn't be necessary.

(Yeah, I'm talking to myself. ;-) )

EssyGreen commented 12 years ago

LOL!

stoicflame commented 12 years ago

(I know I'm attempting to revive a stale thread here.)

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek, what still needs to be addressed here?

EssyGreen commented 12 years ago

Just catching up ... may take a while .... bear with me!

jralls commented 12 years ago

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek, what still needs to be addressed here?

See #164 & #165

EssyGreen commented 12 years ago

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek, what still needs to be addressed here?

I'm still wading through the amended spec having been awol for some months but as far as I can see there is still no source object - just a "SourceReference" with an id, type, description and attribution. If the same source is reference multiple times then I presume these will also be replicated throughout the file creating a fragmented but inadequate source "record".

jralls commented 12 years ago

Over in #134 Sarah and I got going on Source analysis and its importance to good genealogical work.

In #156, Ryan expressed the mission of GedcomX:

The purpose of GEDCOM X has been stated as:

To define an open data model and an open serialization format for exchanging the components of the genealogical proof standard.

When we talk about "the components of the genealogical proof standard," we mean these:

  • Search Reliable Sources
  • Cite Each Source
  • Analyze Sources, Information, and Evidence
  • Resolve Conflicts
  • Make a Soundly-Reasoned Conclusion

Which also recognizes the importance of source analysis. How then should GedcomX record the source analysis? The logical answer to me is to have a proper Source object with a citation property (what's currently called a "Description") and an analysis property (which can be just a long string).

EssyGreen commented 12 years ago

Can you clarify what you mean by "source analysis"? Is this what I call an "interpretation" ie working out what is explicitly and implicitly detailed in a single source?

jralls commented 12 years ago

Source analysis has three phases:

It's important to not try to make connections to evidence from other sources while you're doing this analysis so that you don't read in something that isn't there -- or miss something that is -- because "pieces of the puzzle" seem to fit.

Yes, you could call it interpretation if you like, but "working out what is explicitly and implicitly detailed" leaves out the first part.

EssyGreen commented 12 years ago

Yes that's what I call interpretation :) I prefer to model this in the same way as in the Conclusion model rather than a single text description ... the only difference between them is the number of sources being analysed.

jralls commented 12 years ago

OK. How would you structure the Source class and how would you tie the elements into the conclusional Persons, Relationshps, and Events?

EssyGreen commented 12 years ago

Briefly - simplistic syntax:

Source is top level entity with properties:

Quick example: Source 1: Marriage Cert for Fred Bloggs & Freda Jacobs, GRO ref 1234ABC etc etc

Source 2: Birth Cert for Fred Bloggs, GRO ref 1234XYZ etc etc

Source 3: Birth Cert for Fred Bloggs, GRO ref 1234ABC etc etc

Source 4: My Family Tree, author: Sarah Green other attributes of source etc etc

NB: Persons are contained within each source, not pointers to somewhere else

EssyGreen commented 12 years ago

Hmm actually that's not quite right ... The proof of Source 4 should actually be the proof for Fred (or if you like for Fred's birth event) not the proof of the whole tree! Apologies.

jralls commented 12 years ago

OK. Are the Persons, Relationshps, and Events objects in their own right or are they just strings? Yes, I saw the note at the bottom about them being contained, but they could still be structures with elements like Name, Age, Sex, etc. for the Persons. Is "Evidence" the source-citation data (what the present spec calls the Description)? Should Sources 1 - 3 have an Event, since that's what they seem to describe? Shouldn't the relationships be "Person1 married Person2" (Source 1) and "Person3 child of Person1, Person3 child of Person2" (Sources 2 and 3)?

Isn't "Source 4" really a set of conclusion-model objects (Two Events, the marriage and the birth, 3 Persons, Fred, Freda, and Frederick, and 3 Relationships), each of which has the appropriate SourceReferences pointing back to Sources 1 - 3, along with the appropriate proof statements? (Here I'm assuming that you're not using your "My Family Tree" database as a source for some other database.)

Where would you put "Facts" (for example, the ages of the bride and groom from the marriage certificate)?

EssyGreen commented 12 years ago

Are the Persons, Relationshps, and Events objects in their own right or are they just strings? Objects - I just used strings to simplify syntax here

they could still be structures with elements like Name, Age, Sex, etc. for the Persons. Yup - again simplified for brevity of example

Is "Evidence" the source-citation data (what the present spec calls the Description)? Probably similar - tho' I don't really understand what the GEDCOM X Description is or what it's trying to do

Should Sources 1 - 3 have an Event, since that's what they seem to describe? Yup they can have events but the evaluation in the example is being done against the whole set not just any one single event so as to retain the context (i.e. Fred's birth place etc is intrinsically linked to the Father's name and occupation within the source so I wouldn't want to be able to cite one and quietly drop the other).

Shouldn't the relationships be "Person1 married Person2" (Source 1) and "Person3 child of Person1, Person3 child of Person2" (Sources 2 and 3)? I didn't actually detail the relationships for simplicity but in my world they would be Source 1 (man/wife with marriage event & roles etc), Source 2/3: ditto plus child/mother and child/father

Isn't "Source 4" really a set of conclusion-model objects

If you're happy for the definition of a Source to be "a set of conclusion-model objects" then yes

I'm assuming that you're not using your "My Family Tree" database as a source for some other database

Why would you assume that? That's sort of my point that it is a source (albeit one being changed as the research progresses)

Where would you put "Facts" (for example, the ages of the bride and groom from the marriage certificate)?

  • either as role of event or characteristics of person whichever you prefer :)
jralls commented 12 years ago

Is "Evidence" the source-citation data (what the present spec calls the Description)?

Probably similar - tho' I don't really understand what the GEDCOM X Description is or what it's trying to do

It's trying to use DC/RDF to construct a citation.

I think I see where you're trying to go: To collect several sources, extract the direct evidence for each of them, then treat them together in a proof argument to generate a second-level source (meta-source?) which you would then reference in the conclusion bits (e.g., the birth and marriage events).

I find the concept attractive. Is that what you meant?

EssyGreen commented 12 years ago

It's trying to use DC/RDF to construct a citation

Indeed but personally I don't give a fig about DC/RDF :)

To collect several sources, extract the direct evidence for each of them, then treat them together in a proof argument to generate a second-level source

Yes effectively ... The tree is a source and can be used elsewhere as one, say in another tree .. which is also a source ... ad infinitum

jralls commented 12 years ago

It's trying to use DC/RDF to construct a citation

Indeed but personally I don't give a fig about DC/RDF :)

Right, but you do care about good citations... at least you've said that you do. So, is a citation (title, creator, publication data or repository, date, etc.) what goes into your "Evidence" field?

The tree is a source and can be used elsewhere as one, say in another tree .. which is also a source ... ad infinitum

Well, the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

EssyGreen commented 12 years ago

Right, but you do care about good citations... at least you've said that you do. So, is a citation (title, creator, publication data or repository, date, etc.) what goes into your "Evidence" field?

The evidence object would be a pointer to the Source record (=GEDCOM X source "Description"?) which would contain the title, creator, publication etc (otherwise I find this becomes massively duplicated throughout a file). Hence the citation/evidence item itself only needs an optional equivalent to the GEDCOM "Where in Source" where the source is sufficiently large to warrant it.

the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

I would agree :) Though not sure how you would code up that constraint.

jralls commented 12 years ago

The evidence object would be a pointer to the Source record (=GEDCOM X source "Description"?) which would contain the title, creator, publication etc (otherwise I find this becomes massively duplicated throughout a file). Hence the citation/evidence item itself only needs an optional equivalent to the GEDCOM "Where in Source" where the source is sufficiently large to warrant it.

Oh, then it's a SourceReference. OK. I'd do that differently, embedding the citation in the Source object and use the SourceReference to point to the whole thing. There need be only one citation, one note analyzing the source's quality, provenance, and so on, and one extraction of persons, places, events, characteristics, etc. Might as well keep the whole thing together; everything that refers to it uses a reference.

the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

I would agree :) Though not sure how you would code up that constraint.

Nor am I. I suspect that it's not worth the effort, as anyone who published something containing self-references would be a laughingstock, and there's only so far one can go to protect novices from themselves.

EssyGreen commented 12 years ago

There need be only one citation, one note analyzing the source's quality, provenance, and so on, and one extraction of persons, places, events, characteristics, etc. Might as well keep the whole thing together; everything that refers to it uses a reference.

Not necessarily one citation since there could be different interpretations/hypotheses of the same evidence set. Hence each source could be used in multiple places and hence the need for a fat source record and a slim-line citation.

jralls commented 12 years ago

Not necessarily one citation since there could be different interpretations/hypotheses of the same evidence set. Hence each source could be used in multiple places and hence the need for a fat source record and a slim-line citation.

Hmm. Sounds muddled. Try modeling your approach with different classes for first-level sources (#1 - 3 in your examples) and second-level sources (#4). I think the hypotheses will wind up in the second-level class, which won't have a citation but will have one-to-many SourceReferences (in the example, pointing back to #1 - 3).

EssyGreen commented 12 years ago

Try modeling your approach with different classes for first-level sources (#1 - 3 in your examples) and second-level sources (#4)

I'm not clear on your links here ... Are you saying that a "1st level source" is a Conclusion (#1)? and what in @carpentermp's post (#4) are you saying is a "2nd level source"?

jralls commented 12 years ago

I'm not clear on your links here ... Are you saying that a "1st level source" is a Conclusion (#1)? and what in @carpentermp's post (#4) are you saying is a "2nd level source"?

Sorry, that was a markdown misfire. I didn't mean to link to other issues, I was referring to source numbers 1 - 4 in your earlier comment. So source numbers 1 - 3 I'm calling "1st-level" because they each cite an external source document (a marriage certificate and two birth certificates) and source number 4 I'm calling "2nd-level" because it cites several 1st-level sources and analyzes them collectively.

EssyGreen commented 12 years ago

Thank goodness for that! I was getting really lost :)

I think the hypotheses will wind up in the second-level class, which won't have a citation but will have one-to-many SourceReferences

So still trying to follow you .. you're saying that the hypothesis (4) will wind up as a second-level class ( er ... well you've defined 4 as 2nd class so I guess I can't argue with that)

... and it won't have a citation ... er but it has 3 in the Evidence bag

... but it will have one-to-many SourceReferences ... er yes but these are the citations

Sorry but I'm just not following your argument here, could you re-phrase?

jralls commented 12 years ago

OK, try this:

Source

Synthesis

Conclusion

Persons, Relationships, Events, Places, and Dates extend Conclusion with additional properties.

I prefer "Synthesis" to "Hypothesis" because at some point one becomes satisfied that one has completed the requirements of the GPS and that those points are "proved". It also emphasizes that one is pulling together evidence from a range of sources.

The last bit, Conclusion, doesn't fit the present GedcomX model exactly because Persons and Relationships are separate classes from Conclusions, but for the purposes of this discussion I don't think the distinction is important.

Going back to your original example, I'd categorize the first three "sources" with the Source class and the fourth with the Synthesis class.

EssyGreen commented 12 years ago

Terminology is getting confusing ... here's my definitions (leaving aside code for a moment):

Source - something which holds data of interest to the researcher. Needs to contain details which describe the origin/provenance, author, publisher, owner's reference numbers, scanned images etc etc

Interpretation - a transformation (by someone) of the raw data in a single Source into meaningful information represented by one or more Persons, Relationships, Events and/or Characteristics. A Source may have none/many Interpretations (either made by different people or by the same person if, for example, the data is ambiguous). An Interpretation has a single Source (and hence is contained within it). An Interpretation is a "Derived" Source (because it transforms the original data) - if the interpreter is the researcher then this is somewhat superfluous but if not this is critical to identifying the source as secondary.

Hypothesis - a theory about one or more Persons, Relationships, Events and/or Characteristics made by a researcher as a result of their analysis of the Interpretations of multiple Sources. A Hypothesis with no Sources could theoretically exist but is just a fantasy of the author. If a Source has no explicit "Interpretation" then one must have been made implicitly by the researcher. An Interpretation is a "Derived" Source (because it amalgamates or "synthesises" bits and pieces from a variety of sources).

Evidence - a collection of references to Sources (each of which contributes towards the probability of a Hypothesis being true/false) plus a verbatim Analysis of the logic/reasoning/assumptions/anomalies. A Hypothesis has one "Evidence bag". Evidence must be contained within a Hypothesis.

jralls commented 12 years ago

OK, rewriting my set of structs above to use your terms:

Source

Interpretation

Hypothesis

Conclusion

EssyGreen commented 12 years ago

Thanks :) I'm with you now I think ...

I agree with your Source and largely with your Interpretation (tho I see an Interpretation as embedded within a Source rather than referencing it). I think your Hypothesis is different to mine and I can't see that your Conclusion/Synthesis is doing anything (it seems to have no data).

We disagree on the following points:

jralls commented 12 years ago

I separated out Interpretation to highlight your definitions from yesterday. It could just as easily be Source

My reasoning for separating direct and inferred evidence is that the original source may not always be available, and it's an important distinction. That's especially true in a collaborative situation, where one researcher might consult a source and, using GedcomX, report back to the team. I disagree with you that it's a matter of interpretation: If the evidence is clearly stated in the document (so-and-so is the HoH's stepfather in your example), that's direct evidence. Anything else is inferred (the HoH's natural father's death or divorce in your example). A careful researcher will document her reasoning for inferred evidence.

The Conclusion element was meant to stand in for the rest of GedcomX: the Persons, Relationships, Events, Places, and Dates that the program needs to have in structured form in order to generate charts and reports. If you're not going to use that part, you might just as well use Evernote and your favorite word processor -- a lot of professional genealogists do.

EssyGreen commented 12 years ago

My reasoning for separating direct and inferred evidence is that the original source may not always be available, and it's an important distinction

Yes I had considered that problem ... my approach would be to have a transcription field (or possibly a derived source which is the transcription) which would provide the "original".

A careful researcher will document her reasoning for inferred evidence.

I totally agree ... it's just a matter of how and where that is done. I would do it by providing either an image copy or a transcription as part of the source; and having a hypothesis for the deceased father with a statement quoting the original e.g. something like "Death of Bob: before 1851" - "In the census of 1851, Joey is referred to as the step-son of Fred Bloggs. Since divorce was rare and expensive at this time, it is probable that Bob had died before this date." I might add another hypothesis to the cover the possibility of divorce if I felt it was worth further research or I might leave this until/unless further evidence was found which swayed the hypothesis one way or another. I believe this is clearer than a flag/code of "Direct/Indirect"

The Conclusion element was meant to stand in for the rest of GedcomX

Ah I see! I didn't realise that - wasn't attempting to negate the rest of GEDCOM X - just didn't understand you. ... So your Synthesis id = my Hypothesis id?

jralls commented 12 years ago

transcription field

The GDM had that, they generalized it to a "representation", which could be an image, a transcript, or an abstract, and a source could have more than one.

a hypothesis for the deceased father with a statement quoting the original

That really gets to the heart of it, I think. Inferred evidence needs some sort of a statement along with it. One could even go so far as to say that it doesn't belong in the source at all.

So your Synthesis id = my Hypothesis id?

Yes. While I agree with you about the permanence of genealogical conclusions (that they're not always conclusive), I take the view that one forms a hypothesis from the first round of research, then designs new research to try to confirm or refute it. Once one has completed the requirements of the GPS, it isn't a hypothesis anymore. No matter, so long as we can agree about what goes into each class then the class names don't matter a bit.

EssyGreen commented 12 years ago

I think we're pretty much in alignment

Inferred evidence needs some sort of a statement along with it. One could even go so far as to say that it doesn't belong in the source at all.

I'm happy for it to be in the Hypothesis which is where I see the bulk of the important analysis but to be honest it could go in any old note if the researcher wanted to annotate elsewhere.

Once one has completed the requirements of the GPS, it isn't a hypothesis anymore

I believe they differentiate between a Hypothesis and a Theory but it is just a matter of how much evidence there is so I don't see the need for separate structure.

jralls commented 12 years ago

I believe they differentiate between a Hypothesis and a Theory but it is just a matter of how much evidence there is so I don't see the need for separate structure.

No, the GPS doesn't use either term. Elizabeth Mills started using "hypothesis" in her research process lectures last year in the same way we've been using it here, but I've never heard any of the top lecturers use "theory" in any formal (as opposed to conversational) sense.

I'm not suggesting separating them, either. It's just the reason why I prefer the name "Synthesis" over "Hypothesis".

EssyGreen commented 12 years ago

ESM uses the two separately:

Hypothesis - a proposition based on an analysis of evidence at hand [...] Theory - a tentative conclusion reached after a hypothesis has been extensively researched [...]

(see EE p17)

Will have to agree to disagree over the term Synthesis :)

jralls commented 12 years ago

OK, always nice when Ms. Mills agrees with me! ;-) My copy of EE is at home in California, and I'm in Ireland, so it will be a couple of weeks before I can look at the reference.

Anyway, it's not important enough to disagree about. We agree about what goes where, now we need to get Ryan to do something with it.

thomast73 commented 12 years ago

Lots of good discussion here!

In GedcomX, there is a class in the model called Description. This issue is largely about what the Description class ought to look like.

We would like to modify the Description class to address some of the issues being raised here. In the coming posts, I hope to describe some modifications that are planned and get feedback relative to the discussion here and otherwise.

thomast73 commented 12 years ago

First, since some effort has been put toward definitions, I am going to give some of my own definitions in hopes you will have a better chance at being able to interpret what I am saying:

These may need to be refined a bit, but lets start there.

EssyGreen commented 12 years ago

@thomast73 - excellent summary :)

One tiny picky thing - "Extractions" would be better called "Abstractions" when represented as Persons etc (to distinguish from verbatim extractions which are usually referred to as "extracts")

jralls commented 12 years ago

The current "Description" (paragraph 3.1) isn't a class. It is including by reference the RDF specification, which is (unfortunately) used extensively in GedcomX. You can't replace that with what we're discussing here without breaking the rest of GedcomX.

I don't like that you're conflating "conclusion" with "hypothesis". Conclusion is already a class in GedcomX, as are Person and Relationship (and, if Ryan ever gets around to committing #134, Event). Hypothesis is a separate step which combines multiple sources and which will provide a proof argument for one or more conclusion/relationship/event objects. It should be represented as a separate class, and Conclusion, Relationship, and Event objects should be able to use it as a source instead of a Source object.

Yes, Extraction/Abstraction/Evidence/Interpretation can get sticky, and it's made stickier by re-using the top-level object names Person, Relationship, and Event. The GDM addressed that stickiness in part with the "Persona" concept, which is used in the Record Model.

These may need to be refined a bit, but lets start there.

That is rude. Why should we re-start a 5-month old discussion just because you've finally decided to join in?