managing geospatial complexity

slawler commented 3 years ago

https://github.com/TexasDIS/metadata/blob/main/controlled_terms/object_type_terms.csv

A few things that come to mind when thinking about geospatial complexity in terms of grouping/tagging:

SHP is listed as a Vector Layer, but it is really a single file extension that, when combined with other files in the same directory of the same name (.e.g. .shx, .dbf) represent a Shapefile. When read into a GIS program, the contents of a Shapefile represent a single table with a geometry column, and geospatial reference information (stored in a separate file which has an .prj extenstion, unfortunately the same extension of a HEC-RAS project file).

This type of data is broadly named vector data to differentiate it in a geospatial context raster data. Again, unfortunate, because vector has a very different meaning mathematically. Not saying anything new here, just writing it here to provide context for the next items:

GDB is listed as a Database Layer but it is itself (in a unix system) a folder filled with other files (mostly binary). Which, in other words, is a collection of files similar to a Shapefile. Regarding content, the GDB is like a collection of Vector, Tabular, and sometimes Raster datasets, all of which may be called Layers colloquially and by definition in a geospatial setting.

The NetCDF is actually pretty similar in spirit to the GDB in that it contains tables (or Groups) which may or may not have geometry associated. HDF, not listed, also falls into this category. In either case, these are both single files, identifiable (usually) by an extension. HEC-RAS uses the HDF data format to store all of the types of information listed in the preceding paragraph.

All three of the datasets referenced here are Binary file(s), though categorized differently by term.

Not offering any answers here, but trying to put some issues out to consider as we move forward developing a context to nail this down.

ajdabrowski commented 3 years ago

Noted, and thank you for sharing your expertise. We'll review as we continue refining terms, it's in the nature of developing these to encounter overlap and frustrating mismatches in specificity (e.g. binary vs database). The initial aim here is to start setting some controlled terms (for you to respond to) and that can be used when describing a file or set of files either as a "Layer" or an "Artifact", and begin to identify potential preferred file extensions. As we find them to be lacking we will need to improve and reassess their value.

slawler commented 3 years ago

Perfect, this is no easy task for sure, and getting it right will require iteration, as you note. Looking forward to contributing where appropriate as this effort advances.

brentporter commented 3 years ago

Let's discuss geodatabases first. Typically in GIS circles, the data is separated very broadly into Vector data and Raster data. Those in the field know what that means, meaning what we lose in precision we gain in 'industry knowledge'. That said, what do we do here? Well from metadata specification standpoint, I suggest keeping geodatabase as layer. Implementation-wise it is a folder but the primary means of access will be within GIS software using geodatabase provided APIs/hooks. This isn't to say other means don't exist to access the data as it is indeed in a folder structure but I don't want perfect to the be the enemy of functional here. We want to provide the metadata for TDIS as a means of organizing, discussing, describing and accessing data. Geodatabases as layers I think satisfy most of those needs.

Shapefiles have lots of historical baggage but it is a well known file format again for GIS folks. And that it is again a collection of files, or at least it is that collection of files that give it the most value. So again, I err on the side of GIS users understanding how to use/access the data. Again, I don't believe GIS professionals would be confused by shapefiles as layers (as defined by our specification). Intuitive use and access of system is the main goal I think TDIS is striving for here.

netCDF provides a good counterpoint to geodatabases. It is a format that can store gridded spatial data but is often used outside or at least in parallel to GIS circles. The data that can be stored within is a staggering superset of spatial data, along with many other types. Binary and Artifact make sense for metadata because it has so many uses outside of geospatial.

Something that could help for other users or anyone unfamiliar with file formats would be an easy to use, accessible help system. Leaning on that to help provide additional background will be even more important when we start exposing models and the ability to change parameters, run ensembles, etc.

Deciding the Use Cases for the metadata will surely offer up more insight into this. And as those use cases get documented we can continue to revisit these issues to ensure we are staying on target.

slawler commented 3 years ago

As I understand the framework, a Data Layer is a unit level item:

Data layers have intrinsic spatial characteristics and include (but are not limited to) streets, city boundaries, river gages, and imagery with geospatial information contained within files or made available via service endpoints. Features, as a part of a data layer, are a record-by-record description of each independent “row” in a geospatial attribute table or database. >

I think what is causing the friction is the last line in this definition: fundamentally, a database is a collection of tables, and as such cannot be compared to a table. A database may include the same table, but cannot be defined in the same way.

Similarly, a Shapefile is essentially a single table meeting the Data Layer definition (a table with n features). A GDB on the other hand would be composed of multiple Data Layer's, where each Data Layer (streets, river gages) has n features.

As an example, I'm not sure how we can compare a Shapefile with a streets dataset (lets say a LineString type) with a GDB that has the same streets dataset, but also a river gages dataset (MultiPoint).

Hopefully that makes sense, throwing in an image to help clarify:

brentporter commented 3 years ago

I see where the pain point is now - thanks for the clarification. We will definitely have to rework that part of the description. Thanks Seth!

TexasDIS / metadata

managing geospatial complexity #3