FamilySearch / GEDCOM

Apache License 2.0
160 stars 20 forks source link

Media types for GEDCOM and GEDZIP #5

Closed dthaler closed 2 years ago

dthaler commented 3 years ago

"application/zip" is a media type for ZIP files in general.

However, it is common practice when there are additional constraints specified, such as GEDZIP does, to define a new media type. For one of many examples see application/epub+zip and search for zip in the media types registry for many other examples).

Especially since GEDZIP defines a new file extension .gdz instead of simply reusing .zip then for the same reason it's just as important to register a more specific media type for GEDZIP and have it point to the Gedcom7 spec as the specification for it.

dthaler commented 3 years ago

Similarly a media type should be registered for GEDCOM itself, which would then point to the spec as its reference. The application form is https://www.iana.org/form/media-types

tychonievich commented 3 years ago

I agree we should register media types for GEDCOM and GEDZIP, and like using +zip for gedzip.

Here's a preliminary draft of data to submit for GEDCOM:

After reading RFC 6838, I'm not sure what goes under the Provisional Registration field. I also don't know if I understood the difference between "Application usage" and "Intended use" correctly.

I'm not sure if we should use text/gedcom and application/gedcom+zip or put them both in the application/ tree for uniformity.

dthaler commented 3 years ago
  • Interoperability considerations: none

RFC 6838 section 6.2 says about this field:

  Any issues regarding the interoperable use of types employing this
  structured syntax should be given here.  Examples would include
  the existence of incompatible versions of the syntax, issues
  combining certain charsets with the syntax, or incompatibilities
  with other types or protocols.

Since the GEDCOM 7 spec says "Version 7.0 introduces several breaking changes with version 5.5.1; 5.5.1 files are, in general, not valid 7.0 files and vice versa" then there should be some statement in Interoperability Considerations. For example, is the media type only usable with Version 7? Or would additional parameters be needed along with the media type to ensure interoperability? For example:

Content-Type: text/gedcom; version=7.0

If v8 someday did a breaking change, would we need a new media type or would text/gedcom support multiple versions? If the latter, then "none" isn't sufficient.

  • Fragment Identifier: not used

Ok, but this does mean that one cannot reference a specific cross-reference identifier in a file, as opposed to allowing a fragment identifier to contain a cross-reference identifier in the GEDCOM. Not sure whether there are scenarios that would need that.

  • Provisional Registration: ????? After reading RFC 6838, I'm not sure what goes under the Provisional Registration field.

Provisional registration is not appropriate here, since the spec is already published/stable. For comparison application/epub+zip filled in "This media type is intended to be permanent." I think you could also just say "N/A"

I also don't know if I understood the difference between "Application usage" and "Intended use" correctly.

Your answers look great to me, should be fine as is.

I'm not sure if we should use text/gedcom and application/gedcom+zip or put them both in the application/ tree for uniformity.

I think it's correct to use "text/gedcom" as you propose. RFC 6838 section 4.2.1 explains:

The "text" top-level type is intended for sending material that is principally textual in form.

compared to section 4.2.5 which says:

The "application" top-level type is to be used for discrete data that do not fit under any of the other type names, and particularly for data to be processed by some type of application program. This is information that must be processed by an application before it is viewable or usable by a user.

rsmith-fhiso commented 3 years ago

RFC 2046, section 4.1.2 defines a charset parameter for all text/* media types. I mentioned to @luther earlier that it says

The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.

However I had overlooked RFC 6657 which updates this, describing it as obsolete. Section 3 says:

Regardless of what approach is chosen, all new "text/*" registrations MUST clearly specify how the charset is determined; relying on the default defined in Section 4.1.2 of [RFC2046] is no longer permitted.

I hope the text/gedcom media type will usable with GEDCOM 5.5 and 5.5.1 as well as 7.x, and that means supporting GEDCOM files encoded in ANSEL, ASCII and UTF-16 in both endian forms, in addition to UTF-8. RFC 6657, section 3 says:

In order to improve interoperability with deployed agents, "text/*" media type registrations SHOULD either

a. specify that the "charset" parameter is not used for the defined subtype, because the charset information is transported inside the payload (such as in "text/xml"), or

b. require explicit unconditional inclusion of the "charset" parameter, eliminating the need for a default value.

Option a. seems the obvious choice for GEDCOM, because it can be determined easily enough by reading HEAD.CHAR if it exists, and failing that, reading HEAD.VERS and defaulting to UTF-8 in version 7 and above, and ANSEL in version 5 and below. It's easy enough to spot UTF-16 by looking at the first few bytes. Of course, an application not supporting legacy versions can just assume UTF-8 as this is the only supported option in GEDCOM 7.

The same logic suggests not having a version media type parameter – it can be determined by inspection of the header, just as it can in HTML. Yes, there are breaking changes between 5.5 and 5.5.1, and 7.x, but none prevent you from parsing the header well enough to locate the HEAD.CHAR and HEAD.VERS fields. I don't imagine this process needs spelling out, as it doesn't seem to be in other documents; but if it's necessary to detail it, FHISO documented it in our ELF Serialisation draft, sections 3.1 and 3.2, which I'm sure could be pinched.

dthaler commented 3 years ago

The same logic suggests not having a version media type parameter – it can be determined by inspection of the header, just as it can in HTML.

@rsmith-fhiso Yes I completely agree with that logic. The "Encoding considerations" answer should mention the charset can be determined from the payload (i.e., option A), and the "Interoperability considerations" answer should mention the media type should be usable with older versions and the version can be determined from the payload.

rsmith-fhiso commented 3 years ago

@dthaler:

The "Encoding considerations" answer should mention the charset can be determined from the payload (i.e., option A), and the "Interoperability considerations" answer should mention the media type should be usable with older versions and the version can be determined from the payload.

I think you need to go slightly further. RFC 6657, section 3 says:

Regardless of what approach is chosen, all new "text/*" registrations MUST clearly specify how the charset is determined

I think there are two options.

  1. Under "Encoding considerations", put

    The charset parameter must not be used.

  2. Add under "Optional parameters",

    charset – this parameter may be used for compatibility with non-standard GEDCOM, for example to support GEDCOM with non-standard character encodings, or where the declared encoding in the GEDCOM header is incorrect. It should not be used with compliant GEDCOM 5.5 or later files. The encoding declared in this parameter overrides the embedded encoding declared in the GEDCOM header. The parameter has no default value.

There clearly is some value to 2 over 1, and if it were my decision, I think I'd choose that option, but I'd understand if this is not that way FamilySearch want to go.

tychonievich commented 3 years ago

I appreciate both @dthaler's suggestion to register a media type for v7.x and @rsmith-fhiso's suggestion to register a media type compatible with multiple GEDCOM versions. Both seem good for different use-cases. I thus propose that we register three different media types:

  1. text/ged defining the level+tag system.

    This would be similar in spirit to text/xml.

    We don't have a stand-alone spec for this yet. Each GEDCOM spec has had a chapter devoted to it, but each has included some additional restrictions in that chapter that other versions have not had, such as ever-changing limits on cross reference length (no limit in 4.0 or 7.0, 15 characters in 5.0, 22 characters in 5.5, etc). Perhaps FHISO could create a general "obeyed by all GEDCOM, regardless of version and form" spec?

  2. application/gedcom+ged defining the family history file format defined in the current spec.

    This would be similar in spirit to application/xhtml+xml or image/svg+xml.

    We'd need to decide if we want this to be application/gedcom+ged for all HEAD.GEDC.VERS-identified official GEDCOM versions (i.e. 5.0 and beyond) or application/gedcom7+ged for just 7.x versions. There are pros and cons to each. I notice that image/svg+xml has some references to the 1.1 spec and some to a redirecting URL which currently points to the 2.0 candidate release, implying a mixed approach could also work.

  3. application/gedzip+zip defining the family history and associated media bundle file format defined in the GEDZip section of the current spec.

    This would be similar in spirit to application/epub+zip.

    I think this could be unversioned, particularly if we define the zip as containing "all local URI references" as opposed to "FILE payloads" so that if a future version allows something like "EXTERNAL_SCHEMA \<url>" we're still covered.

Obviously, this proposal is more work than doing fewer media types, but I think it could still be worthwhile for the flexibility it brings.

dthaler commented 3 years ago

To be clear, I was not suggesting that we have separate media types for 7.x vs earlier versions (I just asked the question). I agree with @rsmith-fhiso that the same media type could be used independent of version. If @tychonievich wants to have separate media types, then that's possible and I have no objection, though it's not my preference.

The text/xml and application/xhtml+xml analogy seems like a good comparable to me, so I guess there's arguments both ways between text/ vs application/ for these. I suspect we could get registration approved either way as long as the interoperability considerations question is answered well in the submission.

tychonievich commented 3 years ago

Discusses 2021-06-29 While it would be nice to have a media type registration, it is not on the short-term work queue for the FamilySearch GEDCOM 7 Steering Committee. If FHISO or others want to work on this, they may.

dthaler commented 3 years ago

The FHISO draft on this is at https://fhiso.org/TR/gedcom-mediatype

tychonievich commented 3 years ago

Discussed 2021-08-11 Various conversations have changed this issue's priority. FamilySearch now plans to jointly-author version-agnostic specs for GEDCOM and GEDZIP with FHISO and use them in media-type submissions to the IANA.

dthaler commented 2 years ago

We currently say:

Fragment Identifier: not used

In light of https://gedcom.io/techfaqs/#how-do-i-link-to-individual-structures-within-a-familysearch-gedcom-file do we think that is still correct?

dthaler commented 2 years ago

The IANA expert raised this issue:

Use of any UTF-16 means that the encoding is "binary". (Also, UTF-16 can't be used by any text/ media type, so if use of UTF-16 is desirable, I think this needs to be application/.)

So we can either: a) Update the request to specify that it is not legal to use UTF-16 with the text/vnd.familysearch.gedcom media type, or b) Switch to application/vnd.familysearch.gedcom

dthaler commented 2 years ago

The UTF-16 restrictions are elaborated on in https://datatracker.ietf.org/doc/html/rfc3023

clarkegj commented 2 years ago

The IANA expert raised this issue:

Use of any UTF-16 means that the encoding is "binary". (Also, UTF-16 can't be used by any text/ media type, so if use of UTF-16 is desirable, I think this needs to be application/.)

So we can either: a) Update the request to specify that it is not legal to use UTF-16 with the text/vnd.familysearch.gedcom media type, or b) Switch to application/vnd.familysearch.gedcom

Version 7 and Change log say "UTF-8 is now the only permitted character encoding"

tychonievich commented 2 years ago

Version 7 and Change log say "UTF-8 is now the only permitted character encoding"

Yes, but this mediatype is for all GEDCOM versions, not just 7.

Am I reading the RFCs correctly that UTF-16 is not allowed because the byte string 0A 0D does not encode a line break? If that's the case, GEDCOM's exceptions are:

That said, I haven't found any 1 CHAR UNICODE files in the wild (only in hand-crafted test files). 5.5.1 added UTF-8 and I find many of those.

I think we thus have three options:

  1. Stick with text/vnd.familysearch.gedcom and remove reference to UTF-16
  2. Switch to application/vnd.familysearch.gedcom instead
  3. Do what XML did: register both text/vnd.familysearch.gedcom and application/vnd.familysearch.gedcom, where the second is a hoop-jump to deal with RFC 2046's rejection of UTF-16 and EBCDIC as acceptable character sets
tychonievich commented 2 years ago

Discussed 2021-10-19 Decided we will register text/vnd.familysearch.gedcom and add a note in it saying that UTF-16 should be converted to UTF-8 prior to transmission.