FamilySearch / GEDCOM

Apache License 2.0
161 stars 20 forks source link

Are extension media types usable with g7:FORM? #252

Closed dthaler closed 1 year ago

dthaler commented 1 year ago

The GEDCOM 7 spec says:

FORM (Format) g7:FORM
The media type of the file referenced by the superstructure. This should be a valid media type as
defined by [BCP 13](https://www.rfc-editor.org/info/bcp13). A [registry of media types](https://www.iana.org/assignments/media-types/media-types.xhtml)
is maintained publicly by the IANA.

However RFC 2045 defines media types that can be registered, and also defines extension types/subtypes that start with "x-":

content := "Content-Type" ":" type "/" subtype
                *(";" parameter)
                ; Matching of media type and subtype
                ; is ALWAYS case-insensitive.

     type := discrete-type / composite-type

     discrete-type := "text" / "image" / "audio" / "video" /
                      "application" / extension-token

     composite-type := "message" / "multipart" / extension-token

     extension-token := ietf-token / x-token

     ietf-token := <An extension token defined by a
                    standards-track RFC and registered
                    with IANA.>

     x-token := <The two characters "X-" or "x-" followed, with
                 no intervening white space, by any token>

     subtype := extension-token / iana-token

     iana-token := <A publicly-defined extension token. Tokens
                    of this form must be registered with IANA
                    as specified in [RFC 2048](https://www.rfc-editor.org/rfc/rfc2048).>

BCP 13 is only about registered types, not extension types, per the title of the documents in BCP13. That's why it always says "All registered media types MUST" etc, rather than "All media types MUST".

So currently section 3 of the GEDCOM spec implies, by saying "This should be a valid media type as defined by BCP 13", that only registered types are legal and "x-" types are illegal. However, lower case "should" is ambiguous, and it's unclear whether that is the intent or not, especially since <MediaType> is defined in section 2 in a way that is not explicitly limited to registered types:

The media type data type represents the encoding of information in bytes
or characters, as defined in [RFC 2045](https://www.rfc-editor.org/info/rfc2045)
and [registered by the IANA](http://www.iana.org/assignments/media-types/).

The official grammar for media type is given in RFC 2045, section 5.1.

Also notably the legal syntax for registered types is more constrained (by RFC 6838) than the syntax for extension types. As an example ~ (tilde) was apparently allowed by the RFC 2045 ABNF but disallowed for registered values. So I think x-a~b is legal for an unregistered name but a~b is not legal in a registered name.

dthaler commented 1 year ago

Argument for allowing extension types is that it then allows for arbitrary files to be referenced in multimedia records, rather than only files in some standard format that uses a registered media type.