DataONEorg / object-formats

DataONE Object Formats controlled vocabulary
Apache License 2.0
1 stars 3 forks source link

ESRI Shapefile (zipped) #3

Closed mbjones closed 3 years ago

mbjones commented 3 years ago

Format Metadata

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

Format description

Describe why a new format is needed, including items such as where the format type has been encountered, what software produces it, and what software can read it.

This is for a zipped shapefile directory following the specification for the ESRI Shapefile (http://en.wikipedia.org/wiki/Shapefile) format, which is a common format used for representing vector geospatial data and is defined in https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf. Shapefiles are unusual because the format specification requires the use of three mandatory files (.shp, .shx, and .dbf) as well as several other optional files, all of which share the same basename and must be in the same parent directory, and which collectively constitute the "shapefile" dataset. So, the individual file that has a .shp extension is incomplete without the collection of other files in a directory that together make up a shapefile dataset. Typically, this directory is zipped up for exchange (so the zipped directory often has the .zip extension). In DataONE, many of these zipped up shapefiles are present and typed as zip files, and so are unrecognizable as the more specialized shapefile variant.

In this proposal, I suggest that we create a format for zipped shapefiles that allows this specialized variant of zip files to be recognized and registered as such. This identifier would only be used for objects that represent a zipped directory containing the files that constitute a dataset in ESRI Shapefile format, and would not be used for the individual file components of such a dataset (which each would have different types, and could be the subject of another proposal). The individual subcomponents of a Shapefile have the following assigned Media types:

The Media type of a zipped shapefile is unclear from the specification. My conclusion is that it is best to give it the media type application/zip, and rely on the more specific formatId to differentiate these from other arbitrary zip files.

This format was first requested in Redmine Issue 6883 in 2015, and has been needed for a while.

Specification / Namespace documentation

Provide the location(s) of the documentation of the format specification or the namespace for the format or vocabulary.

Checklist

Considerations

mbjones commented 3 years ago

@jeanetteclark @laijasmine @datadavev @srearl @twhiteaker @taojing2002 @csjx and others.... please comment on how this proposed format identifier for zipped shapefiles looks to you. There is an associaed PR #4 with the exact XML, which boils down to adding:

    <objectFormat>
        <formatId>application/x-shapefile-zipped</formatId>
        <formatName>ESRI Shapefile (zipped)</formatName>
        <formatType>DATA</formatType>
        <mediaType name="application/zip"/>
        <extension>zip</extension>
    </objectFormat>
datadavev commented 3 years ago

Format suggestion looks OK. Agree on the lack of mechanism for media type to indicate file types contained within a zip. https://tools.ietf.org/html/rfc6839#section-3.6 provides a suggestion, but not for multiple file types within a zip.

Refs:

twhiteaker commented 3 years ago

I agree with the format name indicating shapefile while the mediaType is a generic zip.

ESRI goes by Esri these days, so would it make sense to use that casing?

Before this PR, the main reference I had for this type was the EU ref that @datadavev cited, in which they call it x-shapefile. The shapefile addition in the current PR uses x-shapefile-zipped, and is the only example in the XML of "-zipped" appearing in the formatId. This is fine with me, if "-zipped" will be used by convention for future additions when a format is composed of more than one file type collected in a zip archive. If that's the case, perhaps that convention should be documented somewhere in this repo so that future contributors are aware of it.

twhiteaker commented 3 years ago

Should the format indicate that only one shapefile should be in the zip file?

srearl commented 3 years ago

I like the format. Agree also with @twhiteaker's suggestion that, if possible, the format should specify that the zip contains a single shapefile (or would that be a best practice?). This may be rare to the point of being a non-issue but thinking of possible scenarios, I guess a separate format would be needed for shapefiles coalesced into a different format (e.g., gz)?

mbjones commented 3 years ago

We could use application/x-shapefile to match the format described at https://inspire.ec.europa.eu/media-types/application/x-shapefile, which seems to be congruent. I had added -zipped to indicate that it is not the raw shapefile per se, but am happy to drop that part of the name if people like application/x-shapefile better. Their statement that it is superseded by application/vnd.shp is not correct, because that media type refers to only the .shp file, and not the others like .shx, and not in a zipped container.

I like the suggestion in the first link from @datadavev to use media type suffixes for compound types. If we did that, we could make the media type be application/x-shapefile+zip, or it could even be application/vnd.shp+zip which would be accurate (although it ignores the presence of other files in there like .shx files). IANA specifically recommends NOT to use the x- experimental types anywhere, and so using application/x-shapefile as the formatId and application/vnd.shp+zip as the media type could be a good compromise.

And yes, let's change the capitalization of Esri.

datadavev commented 3 years ago

Note that the inspire registry indicates x-shapefile is superseded by vnd.shp, so application/vnd.shp+zip is perhaps appropriate to indicate a zip file contains components of a shape file.

twhiteaker commented 3 years ago

I'd be comfortable with either approach, as long as the approach utilizes a pattern that we can reuse for similar cases. For example, once we have shapefile sorted out, I'll have a hankering for adding geodatabase (a zipped folder of GIS files comprising a file based database) to the list.

mbjones commented 3 years ago

OK, sounds good. I updated the metadata in the issue description above to reflect the use of application/vnd.shp+zip for both the formatId field and the mediaType field. I also updated PR #4 to reflect this change, and merged it into the develop branch. So, last call for any comments or changes -- feel free to speak up if something doesn't seem quite right to you (we live with these decisions for quite some time....). Thanks.

twhiteaker commented 3 years ago

Looks good to me.

mbjones commented 3 years ago

For the record, the final decision on the format is:

    <objectFormat>
        <formatId>application/vnd.shp+zip</formatId>
        <formatName>Esri Shapefile (zipped)</formatName>
        <formatType>DATA</formatType>
        <mediaType name="application/vnd.shp+zip"/>
        <extension>zip</extension>
    </objectFormat>

This will go into the next merge of the formats vocabulary.