DILCISBoard / E-ARK-CSIP

E-ARK Common Specification for Information Packages
http://earkcsip.dilcis.eu
Creative Commons Attribution 4.0 International
11 stars 5 forks source link

Should allowed MIME / media types really be restricted to the IANA list? #704

Open prettybits opened 1 year ago

prettybits commented 1 year ago

The METS schema documentation says this about the MIMETYPE attribute:

The IANA MIME media type for the associated file or wrapped content. Some values for this attribute can be found on the IANA website.

Here it is specifically mentioned that some values can be found on the IANA website (because the media type has been formally registered), and the attribute also set as optional. I don't understand why this has been made more restrictive in the CSIP specification?

In a concrete case I received a package which contained an .iso file and with a MIME type of application/x-iso9660-image. This has a subtype that's part of the unregistered tree and as such not listed on the IANA website (Tika knows about the type as does shared-mime-info which lists it as an alias of application/x-cd-image, also not part of the IANA list).

I don't see a reason to restrict the types of files that can be transported in an information package to only those that have a registered media type and thus propose to relax the requirement to be the same as in METS, i.e. optional and only with a reference to the IANA list.

carlwilson commented 1 year ago

So I'm torn on this one as I agree with the above but believe there is value in having an "authority" for allowed values. It becomes a free for all where interoperability is desired. There are probably a few compromises/solutions. Two spring to mind:

While the first approach is simple and flexible, the second one is right, IMO. A single authority table doesn't make much sense when you look at different requirements, which are MIME for:

The restriction to IANA types for the different types of metadata files isn't that helpful. Discovering that your descriptive metadata is in an MDL file is unlikely to help much. CSIP 26,40,53 would benefit from a stricter approach limiting the choice of metadata types. More flexible requirement conditions might still be needed, e.g. MUST have a well-formed MIME value, SHOULD be in an approved list of rights MD MIMES. CSIP68 is for content files; splitting it into two is probably the best way to start. That's got a good ramble out of my head.

prettybits commented 1 year ago

Your first suggestion to split the requirement into two sounds like a good way forward to me, ideally paired with support for a more extensive list of MIME types (like from shared-mime-info and/or Tika - not sure about the relation between the two right now - as mentioned previously), as getting validation warnings for unregistered MIME types that are otherwise known doesn't sound too actionable to me.

Making a well-formed MIME string mandatory when the attribute is present also makes sense to me as METS doesn't specify that aspect explicitly (or explicitly enough? One could probably take that as implicit.).

On the point about interoperability I don't think this is really in scope in this particular case. Section 5.1 on general requirements for metadata says

Descriptive metadata and most detailed technical metadata is outside the scope of the CSIP. As such, the CSIP does not provide detailed semantic interoperability between different systems. However, as noted in Section 1.2, implementers are welcome to use the Content Information Type Specifications to achieve interoperability at a more detailed level.

Paired with 3.4.1. which states that metadata MUST conform to a standard - though a strong SHOULD would seem to be more fitting there as well to match the statement "the use of established and widely used metadata standards is highly recommended" later in that section, as well as this not really being validatable - that strikes me as sufficient for the purposes of the CSIP.

I too see value in having an authority/registry of / containing defined MIME types, restricting allowed values to known types or specific subsets thereof is then as referenced in the quote above the purpose of Content Information Types. Section 1.2 opens with:

As an interoperability standard, it must be possible to use the CSIP regardless of the type and format of the content users need to handle. At the same time, each individual content type and file format can have specific characteristics which need to be taken into account for purposes of validation, preservation and curation.

which I find very reasonable and basically what I wanted to get at in my initial description.

I'm actually also starting to wonder now why the @sip:FILEFORMAT* family of attributes is only defined for the SIP (and there only for content files) and if it makes sense to seemingly prefer MIME type info over that? Sorry if I'm opening up too many additional questions here as they come to mind.

karinbredenberg commented 1 year ago

We need to go back to the METS standard here. In the Primer ( https://www.loc.gov/standards/mets/METSPrimer.pdf ) on page 34 the definition states: "MIMETYPE (string/O): The IANA MIME media type for the associated file. Some values for this attribute can be found on the IANA website."

prettybits commented 1 year ago

I'm not sure what you're saying here, @karinbredenberg? The part you quoted was also what I opened my initial post with.

karinbredenberg commented 1 year ago

Sorry, I see that I pressed comment to early so the rest of the reply got lost.

We have pointed to the IANA list that is available due to decisions made early on in the creation of CSIP. To change that requirement which is the suggestion given it will be brought to the DILCIS Board. I'm leaning towards keeping the requirements as is but change the wording so its the recommended list to use unless something else have been agreed upon.

prettybits commented 1 year ago

OK, thanks for putting this on the agenda!

Opening the allowed values up by making the IANA list only the default recommendation and allowing local overriding agreements would already help. I'd still be worried about validators potentially producing warnings for MIME types not on the comparatively limited IANA list despite the possibility of additional agreements and the existence of larger (well-maintained and known) lists that could at least be referred to officially as well in some way.

As further input to the discussion and to expand on why I think this should also be made optional: I do believe that there can always be situations where data needing to be archived doesn't have a registered/known MIME type on any list (or isn't readily identified by existing tooling!) and marking packages as invalid or holding ingest back while waiting for tool improvements or new MIME type strings to be thought of (and approved) doesn't need to happen in the general case that CSIP is concerned with when it means to support inclusion "regardless of the type and format of the content users need to handle". Of course that would still often be rightfully treated as a preservation risk, but that should be able to be managed over the further lifecycle.

karinbredenberg commented 11 months ago

The issue have been discussed on todays DILCIS Board meeting and we agreed upon making it a recommendation to use the IANA list. An update of the wording will be made.

shsdev commented 9 months ago

Shouldn't the requirements CSIP26, CSIP40, CSIP53, CSIP68 then be changed to "SHOULD" instead of "MUST"? In the updated version it states that the values are "suggested" values.

karinbredenberg commented 7 months ago

The suggestion is:

Board members acknowledgment of the issue: Tick the box in front of you name to indicate that you have looked at the suggestion.

Voting (Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.

karinbredenberg commented 6 months ago

7 DILCIS Board members have acknowledge the issue 7 DILCIS Board members agree with the solution

The PR will be part of the next release of the specifications.