NCEAS / metadig-checks

MetaDIG suites and checks for data and metadata improvement and guidance.
Apache License 2.0
9 stars 9 forks source link

Non-proprietary format check #23

Closed gothub closed 5 years ago

gothub commented 5 years ago

For the check 'NonProprietaryEntittyFormat', what lists should be used for proprietary and non-proprietary formats?

amoeba commented 5 years ago

Sounds like a tough check to get right from the user's perspective. However, I think a really good approach would be to not find or create a list of non-proprietary formats since there are so many but to instead try to find common proprietary formats such as MS Excel and MS Access files and fail the check of those are found.

When the check passes, the message could indicate that the passing grade isn't the same thing as "you have not used any proprietary formats". What do you think?

gothub commented 5 years ago

Sounds reasonable. Finding the list of proprietary formats may just mean assembling it ourselves.

amoeba commented 5 years ago

Yeah, maybe the NCEAS Data Team would be able to scrap together a short list of what they usually deal with and us developers could fill in any others they don't have much contact with.

mbjones commented 5 years ago

We could start by going through the DataONE formatId list and marking each format ass proprietary or open or unknown/mixed. That should cover all of the files we have in DataONE.

gothub commented 5 years ago

Here is a list of formats from http://cn.dataone.org/cn/v2/query/solr/?q=formatType:DATA&facet=true&facet.field=formatId&facet.mincount=1&rows=0

Please mark formats that are proprietary:

mbjones commented 5 years ago

Note that some of those are general types (e.g., application/octet-stream, text/xml) that may or may not be used for defining proprietary application types. We really should be more specifically typing those when possible.

Also, Matlab's most recent format is an HDF5 file format, and so is technically not proprietary, but their earlier .mat formats were proprietary, so it varies by version for Matlab. HDF5 seems to be missing from the list.

mbjones commented 5 years ago

Also, the full list of registered formats is here:

https://cn.dataone.org/cn/v2/formats

amoeba commented 5 years ago

(@mbjones I added a ticket for adding HDF 4 and 5 to DataONE's Object formats list.)

mbjones commented 5 years ago

Thanks. And I note that the object format list has 5 formats for Matlab, only one of which would be non-proprietary (application/MATLAB-v7.3).

<objectFormat>
    <formatId>application/MATLAB</formatId>
    <formatName>MATLAB programming language script</formatName>
    <formatType>DATA</formatType>
    <mediaType name="text/x-matlab"/>
    <extension>m</extension>
</objectFormat>
<objectFormat>
    <formatId>application/MATLAB-v7.3</formatId>
    <formatName>Mathworks MATLAB version 7.3 (R2006b or later) binary file - HDF5 compatible</formatName>
    <formatType>DATA</formatType>
    <mediaType name="application/octet-stream"/>
    <extension>mat</extension>
</objectFormat>
<objectFormat>
    <formatId>application/MATLAB-v7</formatId>
    <formatName>Mathworks MATLAB version 7 (R14 or later) binary file</formatName>
    <formatType>DATA</formatType>
    <mediaType name="application/octet-stream"/>
    <extension>mat</extension>
</objectFormat>
<objectFormat>
    <formatId>application/MATLAB-v6</formatId>
    <formatName>Mathworks MATLAB version 6 (R8 or later) binary file</formatName>
    <formatType>DATA</formatType>
    <mediaType name="application/octet-stream"/>
    <extension>mat</extension>
</objectFormat>
<objectFormat>
    <formatId>application/MATLAB-v4</formatId>
    <formatName>Mathworks MATLAB version 4 binary file</formatName>
    <formatType>DATA</formatType>
    <mediaType name="application/octet-stream"/>
    <extension>mat</extension>
</objectFormat>
gothub commented 5 years ago

Here is the full list from DataONE formats service:

mbjones commented 5 years ago

I marked Raw image files as proprietary. See https://en.wikipedia.org/wiki/Raw_image_format. However, they often contain critical data and metadata that are lost when the images are converted to open formats. Until a universal open Raw format is available, its probably still beneficial for us to capture raw images in addition to open versions, and so I am not sure we should penalize this in the check. Or maybe we do so and just recognize that there's no good alternative right now, as is similarly true with other binary sensor data formats.

gothub commented 5 years ago

Yeah, not sure, but if we flag (FAIL) formats that are not directly usable (without a conversion by the end user) then there is an incentive for the data creator to provide a more usable format.

gothub commented 5 years ago

This check is discussed in issue https://github.com/NCEAS/metadig-checks/issues/64