Sane defaults for OSM field widths

Esri / arcgis-osm-editor

ArcGIS Editor for OpenStreetMap is a toolset for GIS users to access and contribute to OpenStreetMap through their Desktop or Server environment.

Apache License 2.0

400 stars 131 forks source link

Sane defaults for OSM field widths #81

Open mboeringa opened 9 years ago

mboeringa commented 9 years ago

This is something I think ESRI may have a closer look at. Currently, the OSM Attribute Selector seems to apply a default width of 100 characters to each and every key extracted from the key/value storage.

I am not actually sure how file geodatabases, or different enterprise geodatabases, handle the storage of the actual data technically and what strains this put on the storage, but purely from a user perspective, it would be desirable to have some sane default widths for most keys. Many keys only have valid values that are much shorter than 100 characters. By using the 100 character default for all keys, attribute tables in ArcMap become unwieldy wide and difficult to browse or get an overview from.

I think there must be a better way to handle this, based on statistics for different keys and the values therein. OSM seems to have quite a landscape of tools and APIs accessing and extracting information from the main OSM key/value database. E.g. TagInfo (http://wiki.openstreetmap.org/wiki/Taginfo/API) seems to offer an API that may offer oppertunities, or maybe Overpass Turbo could help giving answers (http://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide).

What I envision:

Somehow extract sane default widths from some OSM API for all current keys in the database, e.g. by getting the 95%-98% smallest values for each key.
Store these widths as XML or so, to be automatically used and applied by the OSM Attribute Selector each time a new field mapping needs to be made for the storage of a key
If the key is not present in the XML file (e.g. newly introduced key), only then, default to 100 characters (updating of the file could help setting a better default later).

mhogeweg commented 9 years ago

what is the width of those fields in OSM itself? that's what should drive the design of the data structure in the geodatabase. after the import you could shrink field sizes based on the data actually imported.

mboeringa commented 9 years ago

what is the width of those fields in OSM itself? that's what should drive the design of the data structure in the geodatabase. after the import you could shrink field sizes based on the data actually imported.

That is maybe the point, there is no real width in OSM... At the true backend of OSM, it is a key/value storage in PostgreSQL. I don't know the current restrictions, but values were once restricted at 255 characters, at least I saw some mailing list posts from 2009 relating to that figure. This means the only real way to "device" useful defaults for field widths is to base these on real statistics about value length per key.

If you want to dive a little bit deeper into the OSM database stuff, here are links to Wikipedia pages about the Rails application driving the website, and the database. The "Database"

http://wiki.openstreetmap.org/wiki/The_Rails_Port http://wiki.openstreetmap.org/wiki/Rails_port/Database_schema http://wiki.openstreetmap.org/wiki/Database

However, I don't think this will give you more clues as to a solution, as it only illustrates the fact of the key/value storage backend, and thus no available "defaults" that could be copied for this project. That is why I think getting statistics from one of the OSM APIs may be the best way to go, and use those for more realistic defaults.

ajturner commented 9 years ago

Looking at the DB Schema the node_tag:k (key) and value are using Character Varying. However the node_tag in the Rails Model has a validation of length: 255.

So seems like the geodatabase should default to 255 instead of 100.

mboeringa commented 9 years ago

Hi Andrew,

Yes, the current implementation of the OSM Attribute Selector actually truncates any values bigger than 100 characters. The code is in OSMGPAttributeSelector.cs. I think this may have been implemented due to storage concerns?...

E.g. see these two excerpts:

// generate a new attribute field
                                        IFieldEdit fieldEdit = new FieldClass();
                                        fieldEdit.Name_2 = tagKey;
                                        fieldEdit.AliasName_2 = nameOfTag + resourceManager.GetString("GPTools_OSMGPAttributeSelector_aliasaddition");
                                        fieldEdit.Type_2 = esriFieldType.esriFieldTypeString;
                                        fieldEdit.Length_2 = 100;

//...then stored the value in the attribute field
                                    // ensure that the content of the tag actually does fit into the field length...otherwise do truncate it
                                    string tagValue = tagItem.v;

                                    int fieldLength = attributeFieldLength[fieldIndex];

                                    if (tagValue.Length > fieldLength)
                                        tagValue = tagValue.Substring(0, fieldLength);

                                    osmRow.set_Value(fieldIndex, tagValue);
                                    rowChanged = true;

However, my main point was that, since many important keys have only a limited range of valid (I can better write "most used") values, it might be a better option to derive defaults for common keys based on statistics.

OpenStreetMap has an open data model, anyone can enter anything they like as tag values. That said, many important keys do have documented "domains" of what most users consider valid values. These are often far less than 100 characters, so it makes sense considering them.

However, since values are not fixed, using real statistics of tag values, and deriving the maximum length of let's say the 95-98% most common values for each key, would achieve what I envision: a more realistic default width for the OSM Attribute Selector to apply to a database field for representing an OSM key.

mboeringa commented 9 years ago

Currently, according to OpenStreetMap TagInfo (https://taginfo.openstreetmap.org/keys), there are 53382 unique keys in the OSM database. I am not saying we need to derive defaults for all of these, but even having defaults for the one thousand most used keys, would be of value.

And storing default field width values for even 50k keys in XML or so to be used by the OSM Attribute Selector, should technically be 0 challenge (of course, deriving the actual default values themselves is not... I am aware of that)

To allow for possible longer new values for a key, the OSM Attribute Selector could possibly add maybe 5-10 characters extra field width above derived values as "headroom"...

ajturner commented 9 years ago

Deriving analyzed defaults is intriguing - but is it necessary? Is there a problem with changing it to 255 so it works all the time?

mboeringa commented 9 years ago

Deriving analyzed defaults is intriguing - but is it necessary? Is there a problem with changing it to 255 so it works all the time?

That is actually a question the ESRI geodatabase experts need to answer. Is it a problem to have an Oracle or SQL Server enterprise geodatabase with let's say 100 text fields all 255 characters, with many of them only partially filled with text, because the actual values in many keys are shorter?? What is the performance penalty if I need to export data from such tables if these tables contain data of entire continents, or possibly the globe, as I need to for my renderer?

Of course, working on my ArcGIS Renderer for OpenStreetMap (ask Christine), I am all in favour of not truncating values. But the above is a realistic scenario, as my renderer needs to extract OSM keys first to fields, before it can generate the thematic feature classes using SQL definition queries.

I currently have over 300 thematic feature layers to generate something like the images below (all straight rendered from OpenStreetMap data, and requiring the ArcGIS Editor for OpenStreetMap's OSM Attribute Selector to support my ModelBuilder / Python based renderer), but it requires a quite large set of attribute fields. I have not seen major issues with several up to 10-20 GB extracts of OSM using File Geodatabases, but what about bigger?

canberra val_di_mello vienna wales

mboeringa commented 9 years ago

By the way, I realized my suggestion is partially flawed, especially the 95-98% cut off of most used values per key. Let's assume we have a key with a rather well defined Wiki documented domain, like

surface = x.

E.g.: Valid values and present in the OSM database (example): paved asphalt cobblestone concrete paving_stones

Of course, if we just count the 95-98% most used values to set the default width of the field, any of the above values may actually not make it, and not be regarded when determining the default width. This might lead to the situation where a value like "paving_stones", can not be stored, since it was not considered one of the most used values in statistics.

Therefore, maybe another approach is needed:

Test if there at least 1000 occurences of key in the OSM database (as otherwise default field widths will probably not be reliably determined, 1000 occurences is also the arbitrary cut-off the OpenStreetMap carto developers use for considering to render a particular tag, e.g. surface = paved).
Test if a key has less than let's than say 50-100 unique values, as that probably indicates some documented domain for the values
If less than 50 unique values, get the longest value of the key, use this as the default
If more than 50 unique values, get the 95-98% most used keys, and determine the longest value, set this as default.

mboeringa commented 9 years ago

On the other hand, maybe I am making this all way to complex..., and we should try to use existing sources, like this JOSM Preset XML file, that has many of the most used domains defined:

https://josm.openstreetmap.de/browser/trunk/data/defaultpresets.xml

Anything not in the preset, could default to 100 or 255 char field width...

Also see this page, that references an XSD for the XML as well: https://josm.openstreetmap.de/wiki/TaggingPresets

mboeringa commented 9 years ago

I now found an article that seems interesting in terms of the possible caveats when declaring a size for variable width columns: http://aboutsqlserver.com/2010/08/18/what-is-the-optimal-size-for-variable-width-columns/