Closed dthaler closed 2 years ago
Similarly a media type should be registered for GEDCOM itself, which would then point to the spec as its reference. The application form is https://www.iana.org/form/media-types
I agree we should register media types for GEDCOM and GEDZIP, and like using +zip for gedzip.
Here's a preliminary draft of data to submit for GEDCOM:
text/gedcom
EF BB BF 30 20 48 45 41 44
, or30 20 48 45 41 44
.ged
After reading RFC 6838, I'm not sure what goes under the Provisional Registration field. I also don't know if I understood the difference between "Application usage" and "Intended use" correctly.
I'm not sure if we should use text/gedcom
and application/gedcom+zip
or put them both in the application/
tree for uniformity.
- Interoperability considerations: none
RFC 6838 section 6.2 says about this field:
Any issues regarding the interoperable use of types employing this structured syntax should be given here. Examples would include the existence of incompatible versions of the syntax, issues combining certain charsets with the syntax, or incompatibilities with other types or protocols.
Since the GEDCOM 7 spec says "Version 7.0 introduces several breaking changes with version 5.5.1; 5.5.1 files are, in general, not valid 7.0 files and vice versa" then there should be some statement in Interoperability Considerations. For example, is the media type only usable with Version 7? Or would additional parameters be needed along with the media type to ensure interoperability? For example:
Content-Type: text/gedcom; version=7.0
If v8 someday did a breaking change, would we need a new media type or would text/gedcom support multiple versions? If the latter, then "none" isn't sufficient.
- Fragment Identifier: not used
Ok, but this does mean that one cannot reference a specific cross-reference identifier in a file, as opposed to allowing a fragment identifier to contain a cross-reference identifier in the GEDCOM. Not sure whether there are scenarios that would need that.
- Provisional Registration: ????? After reading RFC 6838, I'm not sure what goes under the Provisional Registration field.
Provisional registration is not appropriate here, since the spec is already published/stable. For comparison application/epub+zip filled in "This media type is intended to be permanent." I think you could also just say "N/A"
I also don't know if I understood the difference between "Application usage" and "Intended use" correctly.
Your answers look great to me, should be fine as is.
I'm not sure if we should use
text/gedcom
andapplication/gedcom+zip
or put them both in theapplication/
tree for uniformity.
I think it's correct to use "text/gedcom" as you propose. RFC 6838 section 4.2.1 explains:
The "text" top-level type is intended for sending material that is principally textual in form.
compared to section 4.2.5 which says:
The "application" top-level type is to be used for discrete data that do not fit under any of the other type names, and particularly for data to be processed by some type of application program. This is information that must be processed by an application before it is viewable or usable by a user.
RFC 2046, section 4.1.2 defines a charset
parameter for all text/*
media types. I mentioned to @luther earlier that it says
The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.
However I had overlooked RFC 6657 which updates this, describing it as obsolete. Section 3 says:
Regardless of what approach is chosen, all new "text/*" registrations MUST clearly specify how the charset is determined; relying on the default defined in Section 4.1.2 of [RFC2046] is no longer permitted.
I hope the text/gedcom
media type will usable with GEDCOM 5.5 and 5.5.1 as well as 7.x, and that means supporting GEDCOM files encoded in ANSEL, ASCII and UTF-16 in both endian forms, in addition to UTF-8. RFC 6657, section 3 says:
In order to improve interoperability with deployed agents, "text/*" media type registrations SHOULD either
a. specify that the "charset" parameter is not used for the defined subtype, because the charset information is transported inside the payload (such as in "text/xml"), or
b. require explicit unconditional inclusion of the "charset" parameter, eliminating the need for a default value.
Option a. seems the obvious choice for GEDCOM, because it can be determined easily enough by reading HEAD.CHAR
if it exists, and failing that, reading HEAD.VERS
and defaulting to UTF-8 in version 7 and above, and ANSEL in version 5 and below. It's easy enough to spot UTF-16 by looking at the first few bytes. Of course, an application not supporting legacy versions can just assume UTF-8 as this is the only supported option in GEDCOM 7.
The same logic suggests not having a version
media type parameter – it can be determined by inspection of the header, just as it can in HTML. Yes, there are breaking changes between 5.5 and 5.5.1, and 7.x, but none prevent you from parsing the header well enough to locate the HEAD.CHAR
and HEAD.VERS
fields. I don't imagine this process needs spelling out, as it doesn't seem to be in other documents; but if it's necessary to detail it, FHISO documented it in our ELF Serialisation draft, sections 3.1 and 3.2, which I'm sure could be pinched.
The same logic suggests not having a
version
media type parameter – it can be determined by inspection of the header, just as it can in HTML.
@rsmith-fhiso Yes I completely agree with that logic. The "Encoding considerations" answer should mention the charset can be determined from the payload (i.e., option A), and the "Interoperability considerations" answer should mention the media type should be usable with older versions and the version can be determined from the payload.
@dthaler:
The "Encoding considerations" answer should mention the charset can be determined from the payload (i.e., option A), and the "Interoperability considerations" answer should mention the media type should be usable with older versions and the version can be determined from the payload.
I think you need to go slightly further. RFC 6657, section 3 says:
Regardless of what approach is chosen, all new "text/*" registrations MUST clearly specify how the charset is determined
I think there are two options.
The
charset
parameter must not be used.
charset
– this parameter may be used for compatibility with non-standard GEDCOM, for example to support GEDCOM with non-standard character encodings, or where the declared encoding in the GEDCOM header is incorrect. It should not be used with compliant GEDCOM 5.5 or later files. The encoding declared in this parameter overrides the embedded encoding declared in the GEDCOM header. The parameter has no default value.
There clearly is some value to 2 over 1, and if it were my decision, I think I'd choose that option, but I'd understand if this is not that way FamilySearch want to go.
I appreciate both @dthaler's suggestion to register a media type for v7.x and @rsmith-fhiso's suggestion to register a media type compatible with multiple GEDCOM versions. Both seem good for different use-cases. I thus propose that we register three different media types:
text/ged
defining the level+tag system.
This would be similar in spirit to text/xml.
We don't have a stand-alone spec for this yet. Each GEDCOM spec has had a chapter devoted to it, but each has included some additional restrictions in that chapter that other versions have not had, such as ever-changing limits on cross reference length (no limit in 4.0 or 7.0, 15 characters in 5.0, 22 characters in 5.5, etc). Perhaps FHISO could create a general "obeyed by all GEDCOM, regardless of version and form" spec?
application/gedcom+ged
defining the family history file format defined in the current spec.
This would be similar in spirit to application/xhtml+xml or image/svg+xml.
We'd need to decide if we want this to be application/gedcom+ged
for all HEAD.GEDC.VERS-identified official GEDCOM versions (i.e. 5.0 and beyond) or application/gedcom7+ged
for just 7.x versions. There are pros and cons to each. I notice that image/svg+xml has some references to the 1.1 spec and some to a redirecting URL which currently points to the 2.0 candidate release, implying a mixed approach could also work.
application/gedzip+zip
defining the family history and associated media bundle file format defined in the GEDZip section of the current spec.
This would be similar in spirit to application/epub+zip.
I think this could be unversioned, particularly if we define the zip as containing "all local URI references" as opposed to "FILE payloads" so that if a future version allows something like "EXTERNAL_SCHEMA \<url>" we're still covered.
Obviously, this proposal is more work than doing fewer media types, but I think it could still be worthwhile for the flexibility it brings.
To be clear, I was not suggesting that we have separate media types for 7.x vs earlier versions (I just asked the question). I agree with @rsmith-fhiso that the same media type could be used independent of version. If @tychonievich wants to have separate media types, then that's possible and I have no objection, though it's not my preference.
The text/xml and application/xhtml+xml analogy seems like a good comparable to me, so I guess there's arguments both ways between text/ vs application/ for these. I suspect we could get registration approved either way as long as the interoperability considerations question is answered well in the submission.
Discusses 2021-06-29 While it would be nice to have a media type registration, it is not on the short-term work queue for the FamilySearch GEDCOM 7 Steering Committee. If FHISO or others want to work on this, they may.
The FHISO draft on this is at https://fhiso.org/TR/gedcom-mediatype
Discussed 2021-08-11 Various conversations have changed this issue's priority. FamilySearch now plans to jointly-author version-agnostic specs for GEDCOM and GEDZIP with FHISO and use them in media-type submissions to the IANA.
We currently say:
Fragment Identifier: not used
In light of https://gedcom.io/techfaqs/#how-do-i-link-to-individual-structures-within-a-familysearch-gedcom-file do we think that is still correct?
The IANA expert raised this issue:
Use of any UTF-16 means that the encoding is "binary". (Also, UTF-16 can't be used by any text/ media type, so if use of UTF-16 is desirable, I think this needs to be application/.)
So we can either: a) Update the request to specify that it is not legal to use UTF-16 with the text/vnd.familysearch.gedcom media type, or b) Switch to application/vnd.familysearch.gedcom
The UTF-16 restrictions are elaborated on in https://datatracker.ietf.org/doc/html/rfc3023
The IANA expert raised this issue:
Use of any UTF-16 means that the encoding is "binary". (Also, UTF-16 can't be used by any text/ media type, so if use of UTF-16 is desirable, I think this needs to be application/.)
So we can either: a) Update the request to specify that it is not legal to use UTF-16 with the text/vnd.familysearch.gedcom media type, or b) Switch to application/vnd.familysearch.gedcom
Version 7 and Change log say "UTF-8 is now the only permitted character encoding"
Version 7 and Change log say "UTF-8 is now the only permitted character encoding"
Yes, but this mediatype is for all GEDCOM versions, not just 7.
Am I reading the RFCs correctly that UTF-16 is not allowed because the byte string 0A
0D
does not encode a line break? If that's the case, GEDCOM's exceptions are:
That said, I haven't found any 1 CHAR UNICODE
files in the wild (only in hand-crafted test files). 5.5.1 added UTF-8 and I find many of those.
I think we thus have three options:
text/vnd.familysearch.gedcom
and remove reference to UTF-16application/vnd.familysearch.gedcom
insteadtext/vnd.familysearch.gedcom
and application/vnd.familysearch.gedcom
, where the second is a hoop-jump to deal with RFC 2046's rejection of UTF-16 and EBCDIC as acceptable character setsDiscussed 2021-10-19
Decided we will register text/vnd.familysearch.gedcom
and add a note in it saying that UTF-16 should be converted to UTF-8 prior to transmission.
"application/zip" is a media type for ZIP files in general.
However, it is common practice when there are additional constraints specified, such as GEDZIP does, to define a new media type. For one of many examples see application/epub+zip and search for zip in the media types registry for many other examples).
Especially since GEDZIP defines a new file extension
.gdz
instead of simply reusing.zip
then for the same reason it's just as important to register a more specific media type for GEDZIP and have it point to the Gedcom7 spec as the specification for it.