FamilySearch / gedcomx-citation

GEDCOM X extensions for providing citation fields, citation templates, and their associated enumerated values.
Apache License 2.0
3 stars 3 forks source link

Doesn't MARC do what we need? #2

Open jralls opened 11 years ago

jralls commented 11 years ago

The US Library of Congress has developed a widely used standard for storing bibliographic information called MARC, including an XML serialization. This standard is used by, among many others, OCLC, the parents of WorldCat.

Doesn't that meet the needs of GedcomX for specifying the citationField information? Why reinvent the wheel?

fleep commented 11 years ago

I'll review the MARC specification a little bit more today. The only gripe I have upon initial review is that the semantics of it are somewhat abstract and fall outside the general design elegance that's being baked into GedcomX. Take this XML serialization as an example:

http://www.loc.gov/standards/marcxml/xml/collection.xml

GedcomX is generally human-readable as well as machine-readable. MARC opted for tags like where the meaning of the element is tied up in a tag ID which needs to be referenced.

Perhaps an extension of / parallel vocabulary to MARC? While I do appreciate that it is an open standard that is widely used, we have the opportunity to update a 13-14 year old standard with a (in my opinion) somewhat-unwieldy XML schema.

jralls commented 11 years ago

MARC opted for tags like where the meaning of the element is tied up in a tag ID which needs to be referenced.

Granted. It's roots are ancient, and it shows. For example MARC expands to "MAchine Readable Code", which has got to be the most unimaginative government acronym in history.

we have the opportunity to update a 13-14 year old standard with a (in my opinion) somewhat-unwieldy XML schema.

Are you referring to GEDCOM?

Over in the main gedcomx project you wrote:

I think adhering to existing semantic markup standards would be preferable to an entirely new vocabulary where possible.

Did you have such a standard in mind, or were you just stating a general preference? If the latter, I agree in general, but MARC is the only such standard I know of.

fleep commented 11 years ago

we have the opportunity to update a 13-14 year old standard with a (in my opinion) somewhat-unwieldy XML schema.

Are you referring to GEDCOM?

No, referring to the XML serialization for MARC. I like the prior art, but not the implementation. Of course, I come from a background of managing web software teams, and I am always trying to juggle that line between human-readable formats (for ease of development) and machine readable formats, so I have a tendency to be particular.

I think adhering to existing semantic markup standards would be preferable to an entirely new vocabulary where possible.

Did you have such a standard in mind, or were you just stating a general preference? If the latter, I agree in general, but MARC is the only such standard I know of.

General preference - but in the case of the previous discussion that was specifically referring to markup within citation text, not the structure of citations as a whole. Since the styling was geared around denoting which parts of citation text were things like article titles, I was suggesting that we use established X/HTML5 tags for inline tagging ( over ).

My concern with MARC would be that it's relative complexity will be a hindrance to adoption of the standard, and requiring that developers learn this additional schema in addition to GEDCOMX would make this worse.

I'm a relatively new to the GEDCOMX project, so excuse me for catching up. How would incorporating MARC as the standard for bibliographical citations jibe with the mission of the project?

jralls commented 11 years ago

How would incorporating MARC as the standard for bibliographical citations jibe with the mission of the project?

The mission of the project is to exchange genealogical data between applications. One of the most important facets of genealogical data is the bibliographical description of sources, and MARC is a widely-adopted standard for bibliographical data.

The FHL has announced that they will soon make their catalog available over WorldCat, joining Allen County, MidContinent, the Godfrey, the New York Public Library, most state and university libraries, along with the Library of Congress, the British Library, and Libraries Canada and thousands of other libraries and archives world-wide. Since WorldCat uses MARC, it seems like an obvious choice.

fleep commented 11 years ago

Upon reading MARC spec much more closely, I found that one of the higher status codes is used to represent an Electronic Location. For social data and data that originates in applications like FamilySearch/Geni/etc (e.g., it existed as an oral tradition in a family until submitted by a user to an application), this was one of my primary concerns. There's been a little written out there about storing internet data in MARC, but it seems like it's a fine standard.

Where do you see MARC as fitting into GEDCOMX? Does the schema of a SourceCitation become a MARC document?

jralls commented 11 years ago

Where do you see MARC as fitting into GEDCOMX? Does the schema of a SourceCitation become a MARC document?

No, that wouldn't be backwards compatible.

I think the way forward with MARC would be that most of this spec goes away, leaving only the "extension to SourceDescription" part, which changes to a single entry, structuredSourceDescription which is a MARC21 document in the appropriate serialization (XML or JSON).

More flexibility could be offered by leaving the URI field currently called citationTemplate so that other bibliographic data standards could be accommodated as well.

jralls commented 11 years ago

I was just reminded via a search for something else of MODS, a simplified and somewhat more modern reinterpretation of MARC21, also maintained by the Library of Congress. Being simplified, it is intentionally less detailed, but if MARC21 is too complex, MODS might be more suitable.

gthorud commented 11 years ago

I have not checked everything that may be in development wrt standards, but my impression is that the world of meta data is at best very diverse. I would be surprised if you find an existing standard that satisfy the diverse requirements of the international family history community - both wrt meta data and/or citation style. It should be possible long term to develop a meta data standard based on a reasonably sized set of Citation Elements (variables, fields) - but as long as everyone is trying to find a simple solution without doing any work, it may take a long time. Also, the rest of the world (libraries, archives etc) may not be ready to transform their data into the "meta data standard for family history", you will therefore have to handle different solutions and/or to interface with different existing solutions/data (incl. current data in users programs) - for a long time.

Please tell me I am wrong.

GeneJ commented 11 years ago

One of the more interesting initiatives in process is BibFrame (http://bibframe.org/), sponsored by the Library of Congress to "better accommodate future needs of the library community" (ie, a move away from MARC21). A "Community Draft" was published 2 May 2013.

To put Geir's comment in some perspective ("the world of meta data is at best very diverse") see the Wikipedia entry for "Standardized Metadata."

It's been a while since I dove into the details and corresponded with others about the different standards, but know that I simplified (perhaps over simplified) the metadata world by recognizing three overly broad categories of materials or categories of source types

  1. Published works--the heart of librarianship metadata. Stress testing for this group of source types would focus on journal articles and serialized works.
  2. Archival materials--far more interesting as archives tend to organize materials using hierarchical systems and "collections." In the Wikipedia article referenced above, see EAD ("Encoded Archival Description").

DACS is not in the wikipedia entry for Standardized Metadata, but there is a separate entry there for it. See the entry, "Describing Archives: A Content Standard." That article explains that, "DACS specifies only the type of content, not the structural or encoding requirements or the actual verbiage to be used; it is therefore suitable for use in conjunction with structural and encoding standards such as MARC and EAD and with controlled vocabularies such as MeSH, LCSH, AAT, and so on."

Also, national archives tend to develop their own systems. Not exclusive to meta data, see, for example:

Citing Records in the National Archives of the United States http://www.archives.gov/publications/general-info-leaflets/17-citing-records.html

Citing documents in The National Archives [(TNA)] http://www.nationalarchives.gov.uk/records/citing-documents.htm

Citing archival records – Fact sheet 7 http://www.naa.gov.au/collection/fact-sheets/fs07.aspx

  1. (<< should be a 3) Privately held materials and artifacts, aka, the bankers box. This overly broad set of materials frequently contains correspondence and photographs, as well as many untitled, undated, sometimes annotated this and that.

Hope this helps.

GeneJ--not an expert, just pulling for all of you to work together on a solution ... including one that might come in stages.

jralls commented 11 years ago

BibFrame (http://bibframe.org/), sponsored by the Library of Congress to "better accommodate future needs of the library community" (ie, a move away from MARC21).

You mis-characterize BibFrame. From the first paragraph on that website: "A major focus of the initiative will be to determine a transition path for the MARC 21 exchange format to more Web based, Linked Data standards." So not a move away from MARC 21 but the future of MARC 21.

As for DACS, it's not bibliographic. As your quoted paragraph says "... it is therefore suitable for use in conjunction with structural and encoding standards such as MARC ..."

All roads lead to MARC. ;-)

Also, national archives tend to develop their own systems. Not exclusive to meta data, see, for example:

Those references are for the output format -- they're style guides, like CMS, APA, MLA, etc., and are the province of the application. See #1. All of those institutions are known to use MARC 21 for their catalog databases.

While it's also true that not everyone uses MARC 21 -- or if they do, they might not expose MARC 21 records to direct public access -- that's not particularly germane to GedcomX. What's important is that MARC 21 is a widely deployed standard schema that is capable of encoding both library and archival citation metadata. The questions that this issue poses are:

  • Does MARC in fact cover everything that genealogists need for recording their source metadata?
  • Is there another already-developed standard that would do the job better?

I frankly don't think that the expertise exists in the genealogical software community to design a metadata standard that's better than one designed by a sizable team of librarians, archivists, and programmers, so I think it would be unwise for us to attempt to design our own standard.

jralls commented 11 years ago

Please tell me I am wrong.

OK. You're wrong. ;-)

First of all, see #1 for why GedcomX need concern itself only with the metadata.

Second, you seem to think that this standard -- gedcomx-citation -- encompasses all possible metadata. That's absurd. It concerns only bibliographic metadata: The metadata necessary to construct a citation. There are indeed other such standards besides MARC, but all of the others I have found so far are limited to documenting published works; only MARC (and its expression in MODS) can document both published and archival materials.

jralls commented 11 years ago

I re-found another MODS resource: BibUtils is a FOSS program that uses MODS as an intermediate format to translate between a variety of citation database formats such as BibTeX, EndNote, and PubMed. The authors have provided some high-level overview material on MODS that might be helpful in understanding its capabilities without wading through the specs at www.loc.gov.

gthorud commented 11 years ago

@jralls: Second, you seem to think that this standard -- gedcomx-citation -- encompasses all possible metadata. That's absurd. It concerns only bibliographic metadata: The metadata necessary to construct a citation.

I don't know how you come to the conclusion that I have written that we should encompass "all possible metadata"? After all, the context of this discussion is about citations.

jralls commented 11 years ago

I don't know how you come to the conclusion that I have written that we should encompass "all possible metadata"? After all, the context of this discussion is about citations.

Simple. You said:

but my impression is that the world of meta data is at best very diverse

I doubt that you'd have gotten that impression had you confined your survey to bibliographic metadata. There aren't that many standards or file formats which address it, and when you further specify that a schema has to cover both published and archival material you're pretty much left with MARC/MODS.

thomast73 commented 10 years ago

Just trying to list all of the projects that have been identified as possibly useful/adoptable:

Project Notes
MARC21 Developed by the Network Development and MARC Standards Office of the Library of Congress.
MODS Metadata Object Description Schema. Related to MARC21. XML. Has a defined RDF representation. Maintained by the Network Development and MARC Standards Office of the Library of Congress with community input.
BIBFRAME Bibliographic Framework. Related to MARC21. RDF/XML. A collaborative effort of the Library of Congress and Zepheira with community input.
CSL Citation Style Language. More about rendering bibliographic information than the exchange of bibliographic information.
ISDB International Standard Bibliographic Description. Plain text. Human readable.
SourceTemplates A model for genealogical source citations put forth by Real-Time Collaboration, Inc. and BetterGEDCOM. XML template that includes data definition and rendering information.
DMCI Dublin Core Metadata Initiative.

Another resource mentioned was the Sources and Citations page on the BetterGECOM wiki. It contains many ideas and discussion points, including proposals by Robert Raymond and G.Thorud (pdf).

What else should be in this list?

GeneJ commented 10 years ago

Good job, Thomas.

I would add, EAD. Wikipedia, "Encoded Archival Description," summarizes as, "an XML standard for encoding archival finding aids, maintained by the Technical Subcommittee for Encoded Archival Description of the Society of American Archivists, in partnership with the Library of Congress."

jralls commented 10 years ago

Nice recap; the one thing that's missing from the table is what source-types are supported, since we need to capture both published and archival types. MARC21 and MODS include formats for both. BIBFRAME isn't specified yet, it's the project to define the next generation of MARC. Neither ISDB nor DCMI support archival records, and EAD doesn't support published work. OTOH, EAD could be used in conjuction with MARC21 or MODS, since it affords a richer vocabulary for describing archival materials, particularly when combined with EAC-CPF, along with its own ecosystem in ArchiveGrid to complement Worldcat's use of MARC21. Note as well that EAD is currently being updated with a final standard expected at the end of next year.

thomast73 commented 10 years ago

...one thing that's missing from the table is what source-types are supported, since we need to capture both published and archival types.

@jralls, I agree that this is an important part of the needed evaluation. Would you be willing to create such a table? And perhaps summarize how you know that the desired support exists (or is missing).

Also, is it as simple as finding support for "published works" and "archival works"? Or are there other complexities that needs attention?

Are "manuscripts" a subset of "archival works"?

jralls commented 10 years ago

Would you be willing to create such a table?

I sort of did, but not in table form. I'll try with your table:

Project Source Types Notes
MARC21 Both Developed by the Network Development and MARC Standards Office of the Library of Congress.
MODS Both Metadata Object Description Schema. Related to MARC21. XML. Has a defined RDF representation. Maintained by the Network Development and MARC Standards Office of the Library of Congress with community input.
BIBFRAME Both Bibliographic Framework. "Next Generation" of MARC21. RDF/XML. A collaborative effort of the Library of Congress and Zepheira with community input.
CSL Both Citation Style Language. More about rendering bibliographic information than the exchange of bibliographic information, but a data model based on their standard variables would be simple and would allow direct input to applications which use CSL.
ISBD Published only? International Standard Bibliographic Description. Plain text. Human readable. Probably covers published materials only, but it's hard to be sure without buying their book
SourceTemplates Nothing](http://sourcetemplates.org/details-SourceType.php) A model for genealogical source citations put forth by Real-Time Collaboration, Inc. and BetterGEDCOM. XML template that includes data definition and rendering information. This is 100% vapor.
DMCI Published Only Dublin Core Metadata Initiative. There may be extensions out there that support archival materials as well, but the link seens to be the only schema that DCMI.org stands behind. Beyond that, DCMI is a meta-spec for specifying schemas.
jralls commented 10 years ago

Well, fail. How do you make a table in Markdown? The reference seems to be incorrect.

jralls commented 10 years ago

Also, is it as simple as finding support for "published works" and "archival works"? Or are there other complexities that needs attention?

Probably. Even the formats that do handle archival materials are probably oriented towards actual archives like NARA. They may not work too well when the location of the document is some courthouse attic in West Virginia or that old suitcase full of papers that Aunt Suzy has.

Are "manuscripts" a subset of "archival works"?

Usually.

GeneJ commented 10 years ago

Hoping only to be helpful.

Traditionally, US systems are unique regarding application of the archival hierarchy--reference notes typically start with the smallest element and work up; bibliographic citations start with the largest element. Outside of the US, both reference notes and bibliographic citations typically start with the largest element. (Thus, one needs to concern themselves not only with the element, but where that element falls in the hierarchy.)

Use of the term, "bibliographic." Bibliographic metadata (think, "bibliography" or "source list") traditionally concerns itself with higher level cataloging elements; reference note citations (think "footnotes" or "endnotes") often require more detailed elements. The more granular your focus, the more so this is true.

Generally speaking, if it is online, it has been published (or re-published, as the case may be).

In my experience, the vast majority of privately held or privately exchanged material either (a) has not been previously cataloged and/or, (b) has become disassociated with same. It doesn't come with "metadata" per se.

mikkelee commented 10 years ago

Well, fail. How do you make a table in Markdown? The reference seems to be incorrect.

I think it broke because there needs to be spaces around. I pasted your post into mine below, and it looks correct:

Project Source Types Notes
MARC21 Both Developed by the Network Development and MARC Standards Office of the Library of Congress.
MODS Both Metadata Object Description Schema. Related to MARC21. XML. Has a defined RDF representation. Maintained by the Network Development and MARC Standards Office of the Library of Congress with community input.
BIBFRAME Both Bibliographic Framework. "Next Generation" of MARC21. RDF/XML. A collaborative effort of the Library of Congress and Zepheira with community input, .
CSL Both Citation Style Language. More about rendering bibliographic information than the exchange of bibliographic information, but a data model based on their standard variables would be simple and would allow direct input to applications which use CSL.
ISBD Published only? International Standard Bibliographic Description. Plain text. Human readable. Probably covers published materials only, but it's hard to be sure without buying their book
SourceTemplates Nothing A model for genealogical source citations put forth by Real-Time Collaboration, Inc. and BetterGEDCOM. XML template that includes data definition and rendering information. This is 100% vapor.
DMCI Published Only Dublin Core Metadata Initiative. There may be extensions out there that support archival materials as well, but the link seens to be the only schema that DCMI.org stands behind. Beyond that, DCMI is a meta-spec for specifying schemas.
mikkelee commented 10 years ago

FWIW, I'd prefer a MARC-variant or a simple semantic markup such as the CSL variables.

jralls commented 10 years ago

I think it broke because there needs to be spaces around. I pasted your post into mine below, and it looks correct:

Well, I tried adding spaces to mine and it still won't format, so everybody look at @mikkelee's.

thomast73 commented 10 years ago

I was hoping you would include information about how we know that a given project supports "Both" or only "Published" works? What information are we evaluating that says this is true? What features did you evaluate and found them sufficient? When it did not support "Both", how did you confirm this? Help me see what you are seeing—the information that is causing you to say "Yeah, this works." or "No, this fails because...."

Perhaps we also need a closer look at the strengths/weaknesses of that support? It might be helpful to explore each with some specific use cases—actually express the bibliographic metadata for some common use cases? Perhaps I am not asking for this in the DCMI case. But it does seem like a relevant exercise for the other cases.

jralls commented 10 years ago

I was hoping you would include information about how we know that a given project supports "Both" or only "Published" works? What information are we evaluating that says this is true? What features did you evaluate and found them sufficient? When it did not support "Both", how did you confirm this? Help me see what you are seeing—the information that is causing you to say "Yeah, this works." or "No, this fails because...."

Did you follow the links and read them? They got lost in @mikkelee's copy-n-paste exercise, you'll have to go back to the one that I still can't get to format correctly. Each is to the appropriate specification. If the specification includes a mechanism for recording information for non-published works, including fields that can be used for things like series, record group, box, etc., then I accepted that the spec supports recording archival information. In some cases I added to the notes with some additional comments, but perhaps they're not clear enough, so I'll amplify in follow-on posts.

Perhaps we also need a closer look at the strengths/weaknesses of that support? It might be helpful to explore each with some specific use cases—actually express the bibliographic metadata for some common use cases? Perhaps I am not asking for this in the DCMI case. But it does seem like a relevant exercise for the other cases.

Yes, absolutely. I haven't done that level of analysis, and I'm not at all convinced that any of them support non-institutional holdings (that suitcase at Aunt Suzy's) or unorganized holdings like the courthouse attic.

jralls commented 10 years ago

Well, I tried adding spaces to mine and it still won't format, so everybody look at @mikkelee's.

Ah, finally got it. The problem is that it needed a blank line before the header row.

jralls commented 10 years ago

Some further discussion of the standards. I'm going to spread this over several posts, and it may take a day or three to get them all done. I'll start with SourceTemplates.org, ISDB, and DCMI, mostly because their problems are sufficiently obvious that it won't take a lot of work.

SourceTemplates.org

The first time I encountered this I had high hopes. Then I read all 5 pages on the website. There's nothing there except three examples of what a template might look like, a UML class diagram with no explanations, and a plea to BetterGEDCOM to standardize the data field set into a "richer list". Utterly useless.

ISBD

As I noted briefly in the table, there's a price for admission. For those who didn't follow the link, it's $133. No thanks.

From what's available online, it appears that they cover the sorts of published materials found in libraries: Books, maps, musical recordings, and artwork. There is a section on monographs, but it isn't clear from the online information what distinguishes them from regular books. The ISBD example document has only published works, albeit in a variety of formats including book, videocassette (remember those?), CD, and microfilm.

DCMI

The Dublin Core is the grand-daddy of online metadata. Their 15 core elements were one of the first RDF and XML bibliographic specs released. Unfortunately, rather than expanding the vocabulary of those first 15 elements, they've opted instead to define "qualifications" which can narrow the meanings of the terms for greater specificity. I think it's instructive that OCLC was one of the original members of DCMI and that they've opted to use MARC21 for their actual cataloging needs.

thomast73 commented 10 years ago

Just want record another metadata project for possible consideration:

Open Metadata Registry

Not sure if or how it might fit in, but looks interesting none the less.

jralls commented 10 years ago

"The Metadata Registry provides services to developers and consumers of controlled vocabularies". Wikipedia naturally has a more detailed description. This particular one seems to have branched out from its original scientific roots: The Resource Owners includes entries for DCMI and FOAF.

Might be useful for FamilySearch to use it as a repository for the GedcomX controlled vocabs, but it doesn't seem to be particularly applicable here.