clnsmth / soso

For creating Science On Schema.Org (SOSO) markup in dataset landing pages to improve data discovery through search engines.
https://soso.readthedocs.io/
MIT License
1 stars 0 forks source link

schema:encodingFormat Not Included for Unrecognized Data Entities #198

Open clnsmth opened 1 month ago

clnsmth commented 1 month ago

Issue

The deployment of the EML strategy (soso v0.1.0) to data packages in the EDI data repository has resulted in harvesting reports from Google with the following warning:

Missing field "encodingFormat" (in "distribution")
This is a non-critical issue. Items with these issues are valid, but could be presented with more features 
or be optimized for more relevant queries

Cause

This issue arises because some file extensions cannot be mapped to MIME types by the Python standard mimetypes module, which relies on the standard set registered with IANA. Examples of unrecognized file extensions from the report include .R (R programming language), .qmd (Quarto), .GTF (Gene Transfer Format), and files with no extension.

Potential Solutions

  1. Schema.org Recommendation: Schema.org suggests using a relevant URL (e.g., a defining webpage, Wikipedia, or Wikidata entry) to indicate unregistered or niche encoding and file formats. This can be achieved by extending the mimetypes module using mimetypes.MimeTypes.read, with the argument strict=False, to include them as non-standard types.

  2. Declare "Unknown" Format: We could declare unknown file formats as "unknown," though this is an arbitrary value and could lead to mixed interpretations.

  3. Do Nothing: We could choose not to address this issue immediately and wait for niche file formats to be registered with IANA. In the meantime, the schema:description property can provide human-readable descriptions.

Preferred Solution

Of these potential solutions, it seems the Schema.org recommendation is the preferred option. Unrecognized formats can be seeded by an analysis of the data entities in a data repository, and new ones iteratively added as they are discovered.