ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
109 stars 31 forks source link

use of schema:encodingFormat for better data-application mapping #149

Open smrgeoinfo opened 3 years ago

smrgeoinfo commented 3 years ago

EarthCube is working on linking data with applications more effectively, and the approach we're taking is to register file formats with identifier strings that use standard MIME types as the base string, with mime parameters to more explicitly define the file content. The problem is that most MIME types like application/octet-stream, application/xml, text/csv, text/plain, and many others, don’t get you very far in terms of interoperability. The proposed solution is adding ‘type=’ parameters onto MIME types.

Here’s how the linking works:

ECRR (the EarthCube resource registry) has :

  1. Registrations for interchange formats; the JSON-LD for these has a /schema:identifier property that is an array of strings that identify files. For instance LAS has identifier:["http://www.opengis.net/doc/CS/las/1.4","application/octet-stream;type=ASPRS-LAS"] (in the SuAVE viewer ,select 'interchange format' in the resource types on the left, and look for 'External Identifier' in the details for the results)
  2. Registrations for applications have a supportingData property, e.g. for the LViz application:
    "supportingData":{"@type":"DataFeed",
    "name":"Input Data Type specification", "position":"input",
           "encodingFormat":["http://www.opengis.net/doc/CS/las/1.4",
           "application/octet-stream;type=ASPRS-LAS","text/plain; application=esri-asciigrid",
           "application/vnd.esri-asciigrid","Point Cloud Data"]},

    The encodingFormat array here should contain a string that matches a format identifier string. Standard MIME types can be used if there is no registered format, and as in the example here, for the purposes of getting demos working, some format label strings (not URIs) from existing registered data (like ‘Point Cloud Data’) are used.

  3. Dataset distributions should have a /distribution/encodingFormat that uses a format identifier that matches an input format for a registered application. Note this means that if different representations of the dataset content are available, they would each be a separate schema:distribution. If there is a direct link that will GET the data in that format, the schema:distribution/schema:contentURL should be the URL. schema:contentURL should NOT be used to link to landing pages that require further user input to get the data.
  4. The GeoCODES search client checks to see if there are any registered applications that take 'input' supporting data in one of the formats available for a dataset and shows matches with the search results. Implementation is still alpha, but here's a currently working example search detail page.

You can see that the encoding format strings need to be more specific in general than simple MIME types, particularly for the manifold binary formats out there, and generic formats like XML, JSON, NetCDF, or text/csv. Here are some other example format identifiers using the MIME type parameter pattern: application/zip;type=corelyzerarchive text/plain;type=magic-tsv application/hdf5;type=pytables2.0 application/octet-stream;type=ASPRS-LAS application/octet-stream;type=IRIS-SAC

As far as I can tell, this approach is consistent with rfc2046 that defines the mime type for application/octet-stream., and the syntax defined for content types by rfc2045. Applying ‘type’ parameters on other application/, image/ and text/ types probably needs more research, but my investigating so far hasn’t revealed any show stoppers. This is of course not a new problem, or unique to our community, if you have suggestions for a better way to do this, I’d love to hear them!

smrgeoinfo commented 2 years ago

I've compiled a list of more specific file formats (think base format and profile for a particular application); it includes the format strings we're using with the EarthCube GeoCODES platform for linking data and applications, and the formats from the DataOne Object-format list, de duplicated with suggested format strings following the pattern suggested above. Its on a new branch connected with issue 149: encodingformat.csv

mbjones commented 2 years ago

@smrgeoinfo this is a nice list that will be very helpful. When the proposed schema:encodingFormat in your list differs from the original from the source (e.g., ECRR, IANA, DataONE, ...) can you include that original string in the table as another column so that we can effectively crosswalk the vocabularies?