Mapping GEDCOMX to the process model

EssyGreen commented 12 years ago

As a sanity check it is worth checking that GEDCOMX fulfills the needs for the research process. In an attempt to prevent too much debate on the definition of the research process I am citing the model certified by BCG & ESM (see this link http://www.olliatauburn.org/download/genealogy_research_map.pdf)

How well does GEDCOMX support the data described?

Research Goals: Plan, Log, Statement, Question, Hypothesis
- Not at all. Assumed to be beyond the scope?
Sources: Originals (including direct copies) and Derivatives (transcripts, extracts and abstracts)
- No way of specifying original vs derivative - see #136 - (or the particular types of either) beyond DC's generic terminology, and hence weakening of provenance.
- Need a non-abstract class for a Source with specific attributes/properties beyond DC
Information (primary & sceondary)
- =Personas/Facts/Relationships etc in the Record Model? No support for Primary vs Secondary beyond DC's generic terminology
- Need specific attributes/properties for Primary/Secondary
Citation (link between information and source(s))
- =ResourceReference? In development (see #126). Seems to be confused with Evidence.
Evidence (direct, indirect, negative)
- =Persons/Facts/Relationships etc in the Conclusion Model? No Evidence object defined - all are considered to be Conclusions. No concept of hypotheses to enable conflict resolution.
Proof agument
- =Attribution. Minimally covered (as a text field + Confidence level) but is applied to anything (rather than referring to a particular set of evidence in a particular context) so ambiguous.

lkessler commented 12 years ago

I think GEDCOMX should also support all aspects of the Genealogical Workflow as presented by Ron Tanner at RootsTech 2012: http://s3.amazonaws.com/rootstech/original/Ron%20Tanner_Report%20Card.pdf?1326143168 http://s3.amazonaws.com/rootstech/original/GeneaWorkFlow_public.pdf?1328546566

There's probably a lot in common between the Genealogy Research Process and this, but I'm sure the two nicely augment each other.

Louis

EssyGreen commented 12 years ago

@lkessler - both links are based on the GRP so yes I agree should be supported but ...

@stoicflame - could we have some clarity on whether GEDCOMX is intending to provide a minimal or best practice model? This was touched on in #138 - the two goals are quite different I think - either a minimum which must be adhered to (tho' without a regulatory body I'm not sure how the 'must' is ever enforced so I guess it has to be 'should') or the "best in class" which applications should strive to achieve.

Also, you said in #138 that the goal was:

to define a model and serialization format for exchanging the components of the proof standard as specified by the genealogical research process [...] in a standard way.

The proof standard is not quite the same thing as the process model (tho' the two are obviously complimentary). The GPS consists of five elements:

a reasonably exhaustive search;
complete and accurate source citations;
analysis and correlation of the collected information;
resolution of any conflicting evidence; and
a soundly reasoned, coherently written conclusion.

Sorry if I'm sounding picky here but the Proof Standard in its simplest form could just be represented by tacking a ProofStatement and a Bibliography onto the Record Model, whilst the Process Model is more specific and covers the whole range of research activities not just the proof at the end.

To put it another way, should we focus on exchanging/standardising the publication of genealogical data (conclusions at the end of the process) or should we focus on exchanging/standardising the sharing/transfer of genealogical data (all data during and throughout the process).

I'm just trying to understand the scope - sorry if it's tedious.

EssyGreen commented 12 years ago

Some use cases may help illustrate:

A researcher has just completed a genealogical project on behalf of a client. A full PDF report with all the data has been given to the client but the client would also like an electronic version so that they can continue their own research using a software package of their own choosing.
A researcher who has been working alone on a desktop software package would like to work collaboratively on-line with other relatives. He/she wants to transfer all the existing data to the new system where it can be shared with the other researchers (but not made public).
Someone who has completed a piece of research wants to publish it on-line with multiple genealogy repositories so that it is available to the general public as a source in its own right. To avoid breaches of copyright, privacy and plagiarism, the details of notes, sources, research goals and copyright material etc must not be published - only the bibliographic references to the sources, footnotes etc.
Someone trying to trace their ancestors wants to get in touch with others who may share common roots. They want to upload their data to as many on-line repositories as possible to get maximum coverage but they want to make sure people contact him/her rather than just copying his/her data.
A person is new to genealogy and not sure how to go about it. They want to be sure that the software they choose will help them establish good habits founded on best practice.
An established genealogy software provider goes bust. The customers want to switch to alternative applications without losing any data.
A researcher receives a GEDCOMX file from a potential relative. He/she wants to examine the details of the file to see if they really are connected.
A researcher wants to selectively publish their findings as GEDCOMX Records as/when they reach certain conclusions whilst keeping their ongoing stuff private.

Which of the above can/should GEDCOMX Conclusion Model be trying to address?

jralls commented 12 years ago

An excellent set of use cases. Good work!

EssyGreen commented 12 years ago

Many thanks :) I wondered if I was just getting too tedious!

EssyGreen commented 12 years ago

OK here's my take on answering my own questions:

1, 3, 4, 7 and 8 are (in my opinion) all about the researcher wanting to publish their research. The resulting publication is intended to be used as a source by the recipient(s). The researcher may have used anything from an advanced research application to pen and paper when doing the research and this needs to be transformed into a standard electronic format which other people can receive electronically, open and read. The standard format for on-line sources as defined by GEDCOMX is the Record Model. Hence these needs are already covered by the Record Model.
5 is more a question of the accreditation of particular software packages by genealogy authorities. Different bodies/associations may decide to adopt or recommend different software applications. They may or may not consider the 'transferability' of the data to be an important feature and may or may not consider GEDCOMX to be the best way of transferring genealogical data (tho' obviously we would hope they do!). It's up to the user which body they consider to be their best authority and ultimately which software they choose to use.
2 and 6 are about migrating data between different genealogical research applications. In my experience this is a frequent need and use of existing GEDCOM but one which is rarely (if ever) successful (regardless of the claims of the various vendors and the high expectations of the customers). Furthermore, it is unlikely to ever be resolved since competition in the market means that vendors compete on the uniqueness of their products rather than on their same-ness. Unique functionality usually requires unique data structure so a common model is (in my opinion) an unrealistic objective. Some competing vendors will provide interoperability between their products in an attempt to migrate users away from the competitor. This will/does happen in the genealogy market - some vendors provide bespoke migration from other vendors and these are usually more successful in migrating the data in a meaningful way than the generic GEDCOM transfers (which may transfer 100% of the raw data but lose a large percentage of the context/meaning). If GEDCOM can offer anything to this predicament then I believe it has to specify a minumum data specification which is considered vital to genealogical research and hence which a very wide spectrum of applications can adhere to. This will also set realistic expectations for migrating users.

I hasten to add that this is my head speaking ... my heart longs for a way to miraculously pump my cherished data through a magic machine and get it into whatever software I fancy with all the data, links and context intact (and preferably transformed in the "better" way supported by the new software). Sadly this pipe dream just leads to disappointment when I wake up in the real world.

To summarise, I would conclude that the Conclusion Model should focus on a clear and simple data structure which can be interpreted either end of the transfer as unambiguously as possible.

(In concluding this I should go back and retract many of my posts since I have tended to focus on the 'best practice' rather than minimalistic! Yes, I'm shooting myself in the foot here!)

EssyGreen commented 12 years ago

PS: Just to shoot myself in the other foot ... I suspect the "minimalist" model = the Record Model and hence reverses my vote for #138

stoicflame commented 12 years ago

@EssyGreen you've done a great job putting together these thoughts and use cases. I don't think you're being tedious at all.

The issues you bring up are really tough to answer, but in the end I think I arrive at the same place that you articulated:

I would conclude that the Conclusion Model should focus on a clear and simple data structure which can be interpreted either end of the transfer as unambiguously as possible.

Which seems to imply a "minimalist" approach for this first version. But it still needs to be flexible enough to provide for future standards that will fill in more aspects of that "magic machine" with "all the data, links and context intact".

In addition to addressing extensibility concerns, we know that the "minimal" standard needs to address more than what legacy GEDCOM does today. Our task is to identify and address what else is minimally needed and provide for it "as unambiguously as possible".

I suspect the "minimalist" model = the Record Model and hence reverses my vote for #138

Actually, I sincerely think the conclusion model is a better fit for this. The record model as it's defined today attempts to deal with some very narrowly-focused subtleties of dealing with field-based record extraction and hence has a bunch of stuff that I doesn't really fit in this "minimalist" model. Date parts (see issue #130) is a great example of that.

EssyGreen commented 12 years ago

@stoicflame - many thanks for the positive feedback :) I have a couple of points related to your reply:

we know that the "minimal" standard needs to address more than what legacy GEDCOM does today

I'm not sure I agree with you there (tho' some examples may make me change my mind!) ... I think in some ways old GEDCOM attempted to achieve too much and hence ended up with aspects that applications wanted to treat differently but felt they couldn't because of the GEDCOM structure. A clear example of this I think is the PLACe structure ... by making it an embedded structure and including sub-elements it was difficult to convert this to/from a high level Place object without added complexity on import and data loss on export. We've solved this one in GEDCOMX (I think) by making it a record-level element but could fall into the same trap elsewhere. A similar problem happened with the little-used ROMN and FONE sub-elements which were quickly outdated by more advanced phonetic techniques and yet hung around in the sub-structures making the GEDCOM PLACe and NAME structures unnecessarily unwieldy. Conversely I would argue that over-use of the NOTE record links (e.g. alongside CALlNumbers) created an unnecessarily "stringy" structure.

In summary, I think that the flatter the structure (within reason) the more flexible it is ... long trails of sub-elements are more likely to be problematic, especially in relational data scenarios.

I sincerely think the conclusion model is a better fit

You may be right ... to be honest my .Net version of the model is a bit of a mess so it's really hard to see what's in what. I've been hoping for a pull request to get a clearer/new model? Have I missed one or is it still in limbo (or should I go back to using eclipse/java)?

lkessler commented 12 years ago

EssyGreen said:

To summarise, I would conclude that the Conclusion Model should focus on a clear and simple data structure which can be interpreted either end of the transfer as unambiguously as possible.

Sounds like GEDCOM with a few tweaks. :-)

stoicflame said:

In addition to addressing extensibility concerns, we know that the "minimal" standard needs to address more than what legacy GEDCOM does today. Our task is to identify and address what else is minimally needed and provide for it "as unambiguously as possible".

That works for me too.

Louis

EssyGreen commented 12 years ago

Sounds like GEDCOM with a few tweaks

Maybe ... @stoicflame - do you have a list of the good and the bad things about old GEDCOM so we can retain the good and get rid of the bad? If not, is it worth brainstorming?

stoicflame commented 12 years ago

Sounds like GEDCOM with a few tweaks.

It does kind of sound like that, huh? I guess it kind of depends on what you think legacy GEDCOM primarily was. If you think it was a definition of a model for evidence information and a way to encode it, then I agree that this project sounds a lot like GEDCOM with a few tweaks. But if you consider the syntax of a GEDCOM file as being a major part of the spec, then this project doesn't sound like "GEDCOM with a few tweaks".

In other words, I think one of the primary goals of this project is to overhaul the foundational technologies of GEnealogical Data COMmunications. This will enable the genealogical IT community to collaboratively, iteratively, and cleanly integrate the latest trends in application development.

So even though the conceptual scope of GEDCOM X 1.0 won't be a huge revolution, the remodel of the infrastructure will be a big step forward for the community.

In response to the original purpose of this thread, I think the initial scope of this project needs to be limited to the "Cite" and "Analyze" sections of the genealogy research map that @EssyGreen referenced. These are the sections that we're most familiar with sharing and exchanging via legacy GEDCOM, so the focus there has the biggest chance of success. As much as possible, the standard needs to supply well-defined integration points for the other sections of the process model that will be addressed by future efforts.

Right now, we're working now on refactoring the project so that these concepts are clearly articulated at the project site. This effort includes the proposal outlined at #138. We hope this will be a big improvement to the project and we're anxious to get these changes applied for everybody to see.

stoicflame commented 12 years ago

do you have a list of the good and the bad things about old GEDCOM so we can retain the good and get rid of the bad? If not, is it worth brainstorming?

I don't have a definitive list, no. We should probably pull together that list from a lot of different sources, including this issue forum, the BetterGEDCOM wik, etc. We should also proactively request community help to pull together that list. I think a brainstorm is a good idea, but I'm struggling with the best way to facilitate that. I worry that creating a new thread would get too noisy with everybody commenting on everybody else's comments. And that would stifle those who have something to say but don't want to be subject to community scrutiny.

What if I created a web form that people could fill out and submit? I'd broadcast its availability, gather all the comments, and post them somewhere so everybody could see the results without knowing who submitted them. There are some people that I consider legacy GEDCOM experts that I'd be especially anxious to see contribute....

Thoughts?

EssyGreen commented 12 years ago

What if I created a web form that people could fill out and submit? I'd broadcast its availability, gather all the comments, and post them somewhere so everybody could see the results without knowing who submitted them

Sounds like an excellent plan!

EssyGreen commented 12 years ago

I think the initial scope of this project needs to be limited to the "Cite" and "Analyze" sections of the genealogy research map

Initial scope maybe but I think the whole process needs to be covered albeit in a simple form. For example, a simplistic inclusion of "Goals" could be a "ToDo" (=Research Goal) object (top level entity) with:

Title (short text for displaying in overall ToDo list)
Objective (text = Statement/Hypothesis/Question)
CreationDate (datetime)
Sources (list of links to sources to search and/or searched = Research Plan)
ResultsSummary (text for explaining any problems/issues during the search and/or summarising the result)
CompletionDate (datetime)

Plus an (optional) "ToDo" list of links included in each Person (representing the subject of the goal)

A listing of all ToDos in CreationDate order represents the ResearchLog.

This seems pretty simple to me but maybe I'm falling back into the "best practice" rather than the "simplistic" approach again.

EssyGreen commented 12 years ago

Re the other end of the process (Proof/Resolve/Conclude) ... In my experience there has been a growing awareness of the need for evidence-based genealogy rather than just "citing" sources and I think some form of inclusion would add credibility to the model and get a greater chance of GEDCOMX's acceptance. But it's a complex area so needs to be shredded down to a simple form.

lkessler commented 12 years ago

Sounds like GEDCOM with a few tweaks.

It does kind of sound like that, huh? I guess it kind of depends on what you think legacy GEDCOM primarily was. If you think it was a definition of a model for evidence information and a way to encode it, then I agree that this project sounds a lot like GEDCOM with a few tweaks.

Current GEDCOM is a way to store and transfer genealogical conclusions. It also has inclusion of sources and source detail data, but only when used as evidence from the point of view of the conclusions.

But if you consider the syntax of a GEDCOM file as being a major part of the spec, then this project doesn't sound like "GEDCOM with a few tweaks".

No, I don't see the syntax being a major part of the spec. We could take the existing GEDCOM and transfer it mechanically into XML, JSON, or whatever. We could also take the GEDCOM X spec and translate it into the GEDCOM syntax.

The content is all important. The syntax is not. Using a standard syntax potentially gives programmers and users more tools to use. Simple translators would be easy to write to convert GEDCOM X in one syntax to another.

But simple translators to convert to and from GEDCOM 5.5.1 will be essential. If the conclusion data model of GEDCOM X is only "tweaked" from GEDCOM 5.5.1, then the transfer of the data that GEDCOM 5.5.1 can accept will be possible. However, if the conclusion data model of GEDCOM X is rebuilt, then the transfer will not be possible and the genealogical community will have a problem.

Louis

jralls commented 12 years ago

However, if the conclusion data model of GEDCOM X is rebuilt, then the transfer will not be possible and the genealogical community will have a problem.

That's rather overstating the case. If the conclusion data model is substantially different from GEDCOM's, the translation may be more complicated and lossy, particularly going from GedcomX to GEDCOM. It won't be impossible.

The genealogy software community (not the genealogy community, most of which doesn't actually care about the details but is utterly frustrated with the present lack of interoperability between mainstream programs) already has this problem: Few mainstream programs have internal data models that map well to GEDCOM, and their inadequate translation efforts are one of the main sources of that user frustration. The greater problem for GedcomX isn't what should or shouldn't be in its data model, it's that none of the mainstream program vendors are participating.

EssyGreen commented 12 years ago

@jralls - excellent points!

The greater problem for GedcomX isn't what should or shouldn't be in its data model, it's that none of the mainstream program vendors are participating.

I have to say this is something that's bothered me ... hands up anyone here from Family Tree Maker, RootsMagic, Master Genealogist, ReUnion, FamilyHistorian etc etc? Are you lurking or absent?

EssyGreen commented 12 years ago

@lkessler

But if you consider the syntax of a GEDCOM file as being a major part of the spec, then this project doesn't sound like "GEDCOM with a few tweaks".

No, I don't see the syntax being a major part of the spec. [...] The content is all important. The syntax is not.

I totally agree with you on this point ...

simple translators to convert to and from GEDCOM 5.5.1 will be essential. If the conclusion data model of GEDCOM X is only "tweaked" from GEDCOM 5.5.1, then the transfer of the data that GEDCOM 5.5.1 can accept will be possible. However, if the conclusion data model of GEDCOM X is rebuilt, then the transfer will not be possible and the genealogical community will have a problem.

... however, here I disagree ... to limit the scope of GEDCOMX to GEDCOM 5 with a few tweaks would be worthless. The problem with GEDCOM has never been the syntax (it's about as simple as you can get), it's the content (as you say above). Yes, we will need to provide a migration path from 5 to X but this should not be the goal of GEDCOMX. The goal should be to improve the data content and structure to be more in-line with the needs of the user community (which in turn should be more in-line with the needs of the software industry). Ergo map the process model but do it in a simple way that can be implemented in different ways by different software vendors.

PrestonEstep commented 12 years ago

@EssyGreen "I have to say this is something that's bothered me ... hands up anyone here from Family Tree Maker, RootsMagic, Master Genealogist, ReUnion, FamilyHistorian etc etc? Are you lurking or absent?"

Some are lurking, some are absent, some just flat don't care.

EssyGreen commented 12 years ago

@stoicflame - I noticed you put up that web-link for peeps to comment on GEDCOM strong/weak points .... any feedback yet?

stoicflame commented 12 years ago

any feedback yet?

Yes, thanks for reminding me. I need to get that posted.

stoicflame commented 11 years ago

@EssyGreen I finally got around to compiling the responses we got from the little poll we took:

GEDCOM 5.5 Deficiencies

EssyGreen commented 11 years ago

Brilliant! So now we have something to judge GEDCOM X against ... has it resolved these problems/addressed the deficiencies? Which areas do we need to tweak/adjust?

stoicflame commented 11 years ago

has it resolved these problems/addressed the deficiencies?

A lot of them, yes.

Which areas do we need to tweak/adjust?

Maybe that's the next step here? How do you think we should publish that information? Maybe add to that page a table with notes on how (or whether) GEDCOM X intends to address those issues?

alex-anders commented 11 years ago

@stoicflame 'GEDCOM strong/weak points'

So where is the strong point list??

EssyGreen commented 11 years ago

@alex-anders - good point!

@stoicflame

Maybe that's the next step here? How do you think we should publish that information? Maybe add to that page a table with notes on how (or whether) GEDCOM X intends to address those issues?

Yes - definitely the next step and agree with your suggestion

stoicflame commented 11 years ago

So where is the strong point list??

Good point. We didn't gather those. My apologies. What should we do to remedy that?

EssyGreen commented 11 years ago

Add 'em on to the same "Deficiencies" page?

stoicflame commented 11 years ago

Add 'em on to the same "Deficiencies" page?

That implies we've got 'em.

I'll have to set up another request for feedback....

EssyGreen commented 11 years ago

Ah! oh! Ooops!

EssyGreen commented 11 years ago

Here's my take on how GEDCOM X rates against the GEDCOM 5 deficiencies:

Can't separate conclusions from evidence. - OUTSTANDING I think depending on meaning
No support for independent place entities - DONE.
No support for multi-role event entities. - DONE
No support for negative evidence. - OUTSTANDING
Lack of support for many of the most common data items. - OUTSTANDING assuming this relates to sources
Lack of support for multimedia.- OUTSTANDING? I'm unclear where/how multimedia comes into play
Lack of support for formally-structured citations.- OUTSTANDING
No formal policy for managing, processing, and specifying vendor extensions. - OUTSTANDING
Requires sequential processing; the file must be processed entry-by-entry, one at a time. - DONE
Requires inefficient processing; the entire file must be processed altogether and you can't process the file in pieces. - DONE
Lack of reference examples and recipe books. - IN PROGRESS
Lack of shared processing code. - ?Not sure what this meant
Lack of validation and conformance tools. - ?No idea what is planned by FS
No support for formal data types and templates for processing text (e.g. names) - ?not sure what was wanted
Too narrow modeling constraints (e.g. same-gender couple relationships). - PARTIALLY IMPROVED. Still does not support polygamy
Not enough emphasis on non-family relationships and associations. - OUTSTANDING
Fragmented vendor adoption leading to poor interoperability. - HOW TO PREVENT A REPEAT OF THIS? Why do we think this happened?
No standard way to indicate there were no children in a marriage. - DONE
Over-specification, overuse of rarely-used fields.- OUTSTANDING
Lack of support for referential integrity (inter-entity links are disjointed and ambiguous). - IMPROVED
Poor balance between inline vs. referenced data. ? not sure what this meant

lkessler commented 11 years ago

The GEDCOM Deficiencies are a little unfair to GEDCOM. Here's my comments:

• Can't separate conclusions from evidence - I believe GEDCOM X has done this with the separation of the Conclusion mdoel from the Record model. And I like that!

• No support for independent place entities - Yes, a deficiency of GEDCOM. But I don't see how that is done in GEDCOM X which does not have place as a top-level record.

• No support for multi-role event entities. - Do we really want to complicate our lives with this?

• Lack of support for many of the most common data items. - Hard to define what they are, but then its just as easy to include them in GEDCOM as it is to include them in GEDCOM X.

• Lack of support for multimedia; for formally-structured citations. There's nothing in GEDCOM preventing these from being added.

• No formal policy for managing, processing, and specifying vendor extensions. - Allowing extensions is VERY dangerous. True, there's no formal policy for managing the extensions. But is that supposed to be part of the standard? If so, that should be done ASAP for GEDCOM and we should set an authority to manage it. Maybe that authority could also get vendors to implement their GEDCOM correctly. I'm curious if this is going to be a part of GEDCOM X and if so, will FamilySearch staff be the police?

• Requires sequential processing; the file must be processed entry-by-entry, one at a time. - I don't know why Essy called this DONE. XML is no different than GEDCOM. It is a flat file. It is only if you add indexing of records that you get sequential processing. This requires a prior pass of the data and can be done just as easily with XML as with GEDCOM. Is it even valid to call this a deficiency of GEDCOM? GEDCOM was actually designed with INDI and FAM records so that developers could (in the old days with 32 KB memory block limits) randomly access the data.

• Requires inefficient processing; the entire file must be processed altogether and you can't process the file in pieces. - Again, like the last point, I don't see this as a deficiency of GEDCOM. GEDCOM can be processed in pieces Being simple text, it can be scanned quickly to index the records, and then only the records needed need be processed. That's no different than an XML file that is in pieces but zipped together as GEDCOM X is.

• Lack of reference examples and recipe books - I agree the examples in GEDCOM are poor, and some don't even follow the standard. :-(

• Lack of shared processing code. - Not a deficiency of GEDCOM. There are many GEDCOM libaries. How about Dallan Quass' library, which is being used by the GEDCOM X conversion tool.

• Lack of validation and conformance tools. - Not a deficiency of GEDCOM. There are many GEDCOM validators out there.

• No support for formal data types and templates for processing text (e.g. names) - GEDCOM sort of "defined" the data types back then. 15 years later, there's a whole new set of "authorities" out there. This is a trade-off between simplicity and standardization. Using formal data types for everything is a drastic change and will cause existing genealogy software vendors much pain.

• Too narrow modeling constraints (e.g. same-gender couple relationships). - There are a hundred of these sorts of things that can be listed as problems with GEDCOM. But almost all are almost trivial to fix and certainly don't require a complete rewrite to do so. For the same-gender example, simply take off the requirement in the GEDCOM standard that they be the same sex and allow a SPOU tag, rather than the HUSB and WIFE tag. Use the SEX on the individual to determine if it is a husband or wife or same sex couple. However, I like the idea of getting rid of FAMily in GEDCOM and replacing it with Relationships in GEDCOM X.

• Not enough emphasis on non-family relationships and associations - Need to add a GROUP record. Then the Relationship can be between an individual and a group. GEDCOM X doesn't have a GROUP yet, but needs one.

• Fragmented vendor adoption leading to poor interoperability.- What?? Vendor adoption was nearly 100%. That isn't the problem. Poor interoperability is because vendors didn't implement it correctly and that is what caused poor interoperability.

• No standard way to indicate there were no children in a marriage.- Not a deficiency of GEDCOM. GEDCOM has the NCHI tag under the FAM record.

• Over-specification, overuse of rarely-used fields.- Actually, I think GEDCOM was specified very well. GEDCOM X is NOT specified well yet and is very difficult to grasp. If anything, underuse of rarely-used fields in the GEDCOM problem, not overuse.

• Lack of support for referential integrity (inter-entity links are disjointed and ambiguous). - That's not been a problem in GEDCOM. Pretty well all programs correctly maintain and export the links. It's only manually edited GEDCOMs that seem to have problems.

• Poor balance between inline vs. referenced data? If repositories are to use the GEDCOM X Record model, then we will need some way to reference that.

If you take the list of GEDCOM deficiencies, and strike off the few I think are wrong, and discount any that would be easy to fix, there's not much left.

Louis

EssyGreen commented 11 years ago

The GEDCOM Deficiencies are a little unfair to GEDCOM

Maybe but it's real feedback from real people whose opinion Ryan values (or at least that's what it was intended for). Our intention here was to ensure GEDCOM X was going to correct these rather than to re-evaluate GEDCOM 5

No support for independent place entities - Yes, a deficiency of GEDCOM. But I don't see how that is done in GEDCOM X which does not have place as a top-level record.

Ack you are right! I thought we'd resolved that one ages ago

Not enough emphasis on non-family relationships and associations - Need to add a GROUP record. Then the Relationship can be between an individual and a group. GEDCOM X doesn't have a GROUP yet, but needs one.

I disagree with this approach ... I don't see how this helps. All we need is for the Relationship entity to allow any/"Other" types of relationship between two people.

Poor interoperability is because vendors didn't implement it correctly and that is what caused poor interoperability

OK, so why do you think that was?

If anything, underuse of rarely-used fields in the GEDCOM problem, not overuse.

I disagree. If something is rare then it doesn't need to be in the base standard but can be tweaked later as demand arises (as you pointed out above) or (in my opinion) omitted and left for vendors to specify

Lack of support for referential integrity (inter-entity links are disjointed and ambiguous). - That's not been a problem in GEDCOM. Pretty well all programs correctly maintain and export the links. It's only manually edited GEDCOMs that seem to have problems.

FAMS, ASSOs and ALIAs pointers were all problematic in GEDCOM 5 ... how do you match up Person A's ASSO to Person B with Person B's ASSO to Person A ... I relationship is two-way and I think GEDCOM X has solved that with the Relationship entity

If you take the list of GEDCOM deficiencies, and strike off the few I think are wrong, and discount any that would be easy to fix, there's not much left.

Yes but the devil is in the detail :) It is helpful to know which ones we still need to focus on and which ones need tweaking etc

lkessler commented 11 years ago

Essy,

For why I think a Group Record is important, see: http://www.beholdgenealogy.com/blog/?p=1097

Vendors didn't implement GEDCOM correctly for many reasons. Anything and everything, from interpreting the standard incorrectly, to being lazy, to not caring, to not knowing, to simply making mistakes. I don't know. Ask them. It's nearly impossible to implement anything perfectly. Simpler standards have better chances of getting implemented correctly. GEDCOM is not a simple standard. GEDCOM X is even more complex. It will be even more difficult than GEDCOM to do - even if codebases are provided, since translation of the GEDCOM X data to the program's internal data structure must still happen.

By underuse of rarely-used fields, I mean't some of the GEDCOM structures and tags that really would have been useful if developers would have known about them. Such as the ALIAs tag (if used the correct way) and the ASSOciation tag. The GEDCOM Source_Record is quite powerful with EVEN, DATE, PLAC AGNC, AUTH, TITL, PUBL, TEXT, REFN and RIN tags, but very few programs use them to their potential and instead made their own custom citations tags.

A two-way relationship needs to be defined two ways. Parent-Child implies direction, and the type of link and the order of Person1 and Person2 provides that in GEDCOM X. But what do you do in GEDCOM X for other types of relationships, e.g. Barney attended the birth of Pebbles.That is the relationship one way. The other way it is: Pebbles birth event had Barney attending. How do you write the event so that it is unambiguous in a relationship with Person1 and Person2? It is more clear to write the two one-way relationships in this case.

Well what is it really that GEDCOM X is trying to do that GEDCOM doesn't do? Does everything have to change as radically as GEDCOM X is changing it? It certainly doesn't seem so from that relatively small list of deficiencies, of which you noted that GEDCOM X still had work to do on most of them.

Louis

EssyGreen commented 11 years ago

@lkessler I believe that your "GROUP" requirement is already satisfied by the allowance of multiple roles in an event.

Simpler standards have better chances of getting implemented correctly.

I agree. I also agree that GEDCOM X is too complex atm.

some of the GEDCOM structures and tags that really would have been useful if developers would have known about them

Not sure why developer's wouldn't have known about them - they were in the spec. In my opinion they just didn't provide a useful structure (for a variety of reasons e.g. ambiguity, lack of referential integrity, not wanted/used by user base etc)

The GEDCOM Source_Record is quite powerful [...], but very few programs use them to their potential

A good illustration ... I have frequently used the details you described but there is a major flaw in that there was no way to describe the Person or Relationships in that context so the usefulness was extremely limited. Sadly GEDCOM X seems to be disposing of this element rather than improving it.

A two-way relationship needs to be defined two ways

Indeed but this must be done in a way that couples them together ... If Person A has multiple ASSOs with Person B (say Uncle/Nephew and also Step-Father/Step-Son) then it is pretty impossible to link them together in the current GEDCOM - the app can't tell whether Uncle goes with Nephew or Step-Son and vice versa. GEDCOM X has fixed that with the Relationship which binds them together in a particular context.

what do you do in GEDCOM X for other types of relationships, e.g. Barney attended the birth of Pebbles

Event: Birth Role 1: Child - Pebbles Role 2: Witness - Barney

How do you write the event so that it is unambiguous in a relationship with Person1 and Person2

I agree this is too specific atm - only parent/child and couple seem to be supported. I think it should be similar to the ASSO but in one entity:

Relationship: Person 1:Fred Role: Uncle Person 2: Joey Role: Nephew Sources, Notes etc etc

Well what is it really that GEDCOM X is trying to do that GEDCOM doesn't do? Does everything have to change as radically as GEDCOM X is changing it?

Valid questions which only Ryan can answer :)

jralls commented 11 years ago

agree this is too specific atm - only parent/child and couple seem to be supported. I think it should be similar to the ASSO but in one entity:

Relationship:

Person 1:Fred

Role: Uncle

Person 2: Joey

Role: Nephew

+1, but I'd abstract the Person-Role pair into a class as is done with EventRole. Perhaps is should be three parts: Person, Role, and Detail, with enumerated Roles Conjugal Partner, Bio Parent, Bio Child, Adopt Parent, Adopt Child, and Other. That captures the relationships a program needs to construct family, ancestry, and descendancy. Detail is a free string so that any other relationships the researcher wants to capture can be specified. Sources, Notes etc etc

jralls commented 11 years ago

Well what is it really that GEDCOM X is trying to do that GEDCOM doesn't do? Does everything have to change as radically as GEDCOM X is changing it?

Valid questions which only Ryan can answer :)

Which he did in #156.

jralls commented 11 years ago

Requires sequential processing; the file must be processed entry-by-entry, one at a time. - I don't know why Essy called this DONE. XML is no different than GEDCOM. It is a flat file.

It is not a single XML document, it is a ZIP file containing a bunch of XML documents and other files. See https://github.com/FamilySearch/gedcomx/blob/master/specifications/file-format-specification.md

EssyGreen commented 11 years ago

I'd abstract the Person-Role pair into a class as is done with EventRole

I wouldn't disagree I was just using simple syntax to illustrate the concept

lkessler commented 11 years ago

Essy said: "@lkessler I believe that your "GROUP" requirement is already satisfied by the allowance of multiple roles in an event."

But I want groups that can have events of their own, just as I want places that can have events of their own. Because of that, I feel groups and places need to be a top level record.

p.s. I like GEDCOM X merging "events" into "facts".

Essy said: "Not sure why developer's wouldn't have known about them - they were in the spec".

I'm glad you think we developers have perfect interpretation and total recall. :-) I'm sure you and I both know about every little detail that is in GEDCOM X already ... NOT!

John: Thanks for pointing out #156 - Clarify What GedcomX Is. Of course, stoicflame refers to the issue we are in, #141, to be the one that will articulate How GEDCOM X will do this. And this one now refers back to that one. And that one refers to this one ...

John: Yes, I know they've physically packages it into a ZIP contains thousands of files. So change my statement to: "ZIP is no different than GEDCOM. It is still a single file that must be read and the contents extracted for processing." The point was that it is not a database with indexed retrieval. I don't believe you can read a single file from a zip without unzipping it first.

Louis

EssyGreen commented 11 years ago

I want groups that can have events of their own

What are the benefits of doing it this way? You lose flexibility because there would be a very small number of events with exactly the same people in it. For example, If you define a group say Fred, Freda and Joey Bloggs for Joey's birth; the same "Group" might be appropriate for a baptism but I can't think of many other events you could put under the same group. Similarly, if you define a "Group" for Fred and Freda Bloggs as a married couple then you will have a marriage and possibly a Residence or two but chances are one will die before the other and so the last Residence events would have to be split to cater for the date differences. I can't see that you will ever have more than a couple of events per group so it seems to be rather redundant.

Not sure why developer's wouldn't have known about them - they were in the spec

I'm glad you think we developers have perfect interpretation and total recall

I don't understand your sarcasm .. the GEDCOM 5 spec is publicly and widely available. Developers don't need total recall - they just need to be able to read! I think the reason that some things didn't get adopted was more to do with the fact that they didn't provide a clear benefit (e.g. the SOUR details would have been really useful if there was some way to specify the people and not just the types of event - without this the data is just an additional data entry/management burden)

jralls commented 11 years ago

Louis,

And that one refers to this one ...

You're a programmer. You're supposed to like recursion! ;-)

I don't believe you can read a single file from a zip without unzipping it first.

You believe incorrectly. From the Wikipedia zip (file format) article:

Compressing files separately, as is done in zip files, allows for random access: individual files can be retrieved without reading through other data.

ttwetmore commented 11 years ago

Just picking up on the zip thread for a moment.

I have no objection to zip as a way to package GEDCOMX data. My objection came from keeping each top level object as a separate file. But not because each is a separate file per se, but because the GEDCOMX standard now uses XML with many namespaces and long URIs, so each separate file must therefore contain a truly incredible amount of redundant information that is included anew in every file.

Doesn't it seem to be the pinnacle of irony to use compression on a set of files with so much redundant information? Doesn't it seem to be about the most anti-common sense thing you've heard of in the past couple weeks?

As John pointed out, the zip file contains a directory of its contents, so each internal file can be read separately without unzipping the whole file. So assuming that each top level will be a separate file there are some conclusions that need to be made.

Note that when reading individual files out of a zip file you would have to have a reason to be doing that, which means that an id or key must first be supplied to identify the file. Where would that id or key from? This points out how important it will be to add an index file to the zip file to be extracted first and then used as an index for everything else in the zip. For example, the unique ids, the persons' names, the id's of all other internal files that each internal file refers to and why. Just imagine the problem of finding someone with a given name in the zip file, and they extracting their pedigree from the zip without first reading the entire zip file into an auxiliary database. The only practical way to deal with these zip files is to think of them as mini-databases and to supply an index file to that database in the zip. Which is exactly what Java does with the meta file that it adds to zip files to turn them into jar files.

So if GEDCOMX sticks with the idea of zip files with individual files for each top level object, then the standard must also include a definition of the aforementioned index file.

jralls commented 11 years ago

Which is exactly what Java does with the meta file that it adds to zip files to turn them into jar files.

And exactly why the GedcomX spec does too.

now uses XML with many namespaces and long URIs, so each separate file must therefore contain a truly incredible amount of redundant information that is included anew in every file.

Nope. The ZIP can contain a DTD which provides all of the namespaces and their URIs.

ttwetmore commented 11 years ago

And exactly why the GedcomX spec does too.

I'll believe you, but I can't get this from the specifications. I hope you realize I was not talking about a simple directory listing file, but a rich index file. The GEDCOMX specifications are singularly obtuse and almost unintelligible. Maybe the answer is in the header set concept, but if so, the authors have not explained it.

now uses XML with many namespaces and long URIs, so each separate file must therefore contain a truly incredible amount of redundant information that is included anew in every file.

Nope. The ZIP can contain a DTD which provides all of the namespaces and their URIs.

I guess I'll believe you again, but this was not the results that Tamura reported. He expanded GEDCOMX files and inspected the contents. There were no DTD's and every "file" contained redundant definitions of everything. On converting GEDCOM files to GEDCOMX zip files, using the GEDCOMX provided tool, he experienced over a 35 times increase in file size. From that I can't come to any other conclusion than the current GEDCOMX file format is a disaster. Maybe a DTD will make it workable. And I have already expressed my strong opinion that the archival format should be all simple tags with no namespaces and no long URIs.

EssyGreen commented 11 years ago

@ttwetmore - I share your concerns here and would also vote for simple tags.

jralls commented 11 years ago

The GEDCOMX specifications are singularly obtuse and almost unintelligible.

I don't find them to be either. Woefully incomplete, but neither obtuse nor unintelligible.

That said, if you can't understand the specs, then you're arguing about some stra

guess I'll believe you again, but this was not the results that Tamura reported. He expanded GEDCOMX files and inspected the contents.

In any case, I said can contain. That's not required in the GedcomX spec (though I think it would be a good idea), but it's not prohibited, either, and the XML recommendations allow it. Tamura Jones seems to enjoy throwing rocks without actually doing anything useful. From where did he get these GedcomX files, considering that the only code is Ryan's backwards JAXB mess that he uses to produce the documentation and which doesn't actually do anything?

Moreover, who cares these days about a couple of K of URIs? Text is tiny. Are you writing DeadEnds for the Arduino?

And I have already expressed my strong opinion that the archival format should be all simple tags with no namespaces and no long URIs.

Yup. The RDF discussion is in #165. No need to bring it up here. Anyway, unless FamilySearch can be persuaded to separate the static data-exchange solution from the web services solution, RDF is necessary, so unless we can get that split to happen we should work on making the RDF aspect as painless as possible.

ttwetmore commented 11 years ago

The GEDCOMX specifications are singularly obtuse and almost unintelligible.

I don't find them to be either. Woefully incomplete, but neither obtuse nor unintelligible.

You're clearly a lot smarter than me; I can graciously accept that.

Moreover, who cares these days about a couple of K of URIs? Text is tiny. Are you writing DeadEnds for the Arduino?

I do. A lot. And it isn't a couple K. When you are using between one and two orders of magnitude too much resource to encode something that is very simple, you are being profligate to the point of stupidity no matter how cheap the resource. Our world still has such things as appropriateness and elegance and rightness in it.

Tamura Jones seems to enjoy throwing rocks without actually doing anything useful. From where did he get these GedcomX files

He got the GEDCOMX files by building them himself from the tool that was announced by GEDCOMX, a tool that converts GEDCOM files to GEDCOMX files, a tool that is available on the GEDCOMX github somewhere. He used it on a number of his test GEDCOM files and published the results. You can check his blog for details.

Anyway, unless FamilySearch can be persuaded to separate the static data-exchange solution from the web services solution, RDF is necessary, so unless we can get that split to happen we should work on making the RDF aspect as painless as possible.

I have suggested an excellent solution to the RDF conundrum that allows the archive format to contain no namespaces and no RDF URI's, but with an easy capability to generate the full form for those who feel they need it. But like you said this ain't the thread for it.

FamilySearch / gedcomx

Mapping GEDCOMX to the process model #141