Dealing with long OSM key names

Hi Thomas,

Up to now, I haven't really had a problem with this issue since most of the development I have been doing was against a File Geodatabase, which has a pretty generous max fieldname length of 64 characters, but it is clear that there needs to be some proper way to deal with long OpenStreetMap key names in the auto-translation process to database field names of the OSM Attribute Selector tool, and the import tools like the OSM File Loader (Load Only), as keys are always directly translated to database field names.

E.g. take a key like:

generator:output:compressed_air

This key from the pretty decently defined Power scheme of OSM has 31 characters, caused by the concatenation of several "namespaces".

Now, that 31 characters already surpasses the "minimum" allowed maximum field name limit for enterprise geodatabases supported by ArcGIS, which is 30 characters according to this page.

This is even more so if we take into account the true conversion taking place, where the field name will be prefixed with "osm" and the the colon replaced with "_58\":

osm_generator_58_output_58_compressed_air = 41 characters

So this key, while representable in a File Geodatabase, will cause a major issue when used in an Enterprise Geodatabase.

In fact, I noticed that if I supply a key name that is to long to translate to a valid File Geodatabase field name, e.g.:

this_is_a_crazy_long_tag_name_that_can_not_be_extracted_can_it = 62 characters

then in that case the tool gives the following - slightly inaccurate as it doesn't mention length - warning about an invalid field name, and skips creating the field and key extraction:

The name of the Field is invalid: valid names may contain letters, numbers or underscores

While understandable, this is somewhat undesirable behaviour, as it limits what keys can be extracted and used in ArcGIS.

My suggestion: I have been thinking about this, and my current suggestion would be to change this behaviour:

If the tool detects it can translate the key name to a valid field name of <= 30 characters, then maintain the current behaviour for auto-translation of field names as described above.
If the result of the initial auto-translation is unacceptable in terms of field name length, so > 30 characters, then resort to an alternative auto-generated name: 1) Keep the "osm_" prefix although it does cost 4 chars. It is needed for consistency and auto-detection of fields referencing OSM keys in SQL statements. 2) Check if there is a ":" colon namespace character in the OSM key name. 3) If NOT, and the key name is longer than 30 characters, then skip the field name. Really, the chances of a key name with >30 characters without namespace, should be remote, so I think it fair to fail in this instance. People should really not define key names this long. 4) If YES, there is a namespace, then take the first character of the namespace as a substitute for the entire namespace name and use this to generate the output field name. Since single character key names are essentially non-existent (at least not in the documented OSM Wiki I think), using a single character should not result in key name conflicts. Of course, there can be two namespaces starting with the same character, but in real key names they are unlikely to both: 1) be over the 30 character limit at the same time, and 2) use the same ending key name. So I think this translation should be relatively safe. 5) In the case of 4), use just "" underscore instead of "_58\" to replace the colon. This saves 3 precious characters for each namespace.

So, using this routine, the key:

generator:output:compressed_air

would result in a database field name of:

osm_g_o_compressed_air

which is just 22 characters, which is still 8 characters below the max, leaving some head room for even more complicated keys. Of course, a database field name alias must be set as well, as is already the case with the current implementation.

Preferably, for consistency between File and Enterprise Geodatabases and SQL statement development, this behaviour should be implemented for both types of geodatabases with the stated limit of 30 characters.

@ThomasEmge , what are your thoughts on all of this?

Marco, I agree long name for tags are rare but they do happen. Here is the approach the code currently takes:

Translate the name into a safe string acceptable to the database for the field name.

As you mentioned something generator:output:compressed_air becomes osm_generator_58_output_58_compressed_air

In case the string is too long, it will be chopped to the maximum the database allows (as defined in the ArcObjects API). And this will become the field name. I haven't done extensive testing on this, but there still is the chance for conflict for the first x characters. So far the first x characters proved to be unique enough to avoid conflicts. However this is also one those cases, where the tool will 'interfere' and modify the data. The tool is comparing strings and treating upper and lower case the same (database constraint). As a result, multiple variations of: FIXME, fixme, FixMe, etc. will be collapsed into one field.
The original name of the tag is stored in the field alias as there are no database constraints on this property. There is also no restriction on the length and the full string is required if you ever wanted to export the data again back into an OSM file.

Hi @ThomasEmge ,

Thanks for the insights.

However, my concern with the current approach is that it is non-universal and not "cross-database" compatible. As you currently describe it, the length at which the field name gets chopped, will be variable, depending on the particular database.

I think it would be highly desirable to have a "cross database" compatible approach, like the one I described, which will result in uniform field naming whatever the underlying enterprise geodatabase (or file geodatabase). This will also facilitate ESRI geodatabase replication, which obviously would fail with the current approach if multiple different database types are involved (even with just a File and Enterprise Geodatabase). Taking the "lowest denominator" approach as in my suggestion, would. And it doesn't have to be advanced geodatabase replication, saving FCs to a File Geodatabase, and sharing that for import in another Enterprise Geodatabase, would also be helped with a uniform, predictable, approach.

Having aliases, although nice, isn't a complete substitute for this, as SQL statements in ArcGIS, like a layer file's Definition Query, require true database field names. This would again lead to the inability to share layer files created for OSM derived data between users running different database platforms, at least in those rare cases with long, but valid, key names used in the layer file's properties (and that doesn't even have to be SQL as in a Definition Query or labelling SQL Query, it could be other properties as well like symbology or label field).

However this is also one those cases, where the tool will 'interfere' and modify the data. The tool is comparing strings and treating upper and lower case the same (database constraint). As a result, multiple variations of: FIXME, fixme, FixMe, etc. will be collapsed into one field.

Although this data should not be uploaded back to the OSM servers, collapsing the data into one field seems acceptable considering these are rare corner cases and are actually disputable and undesirable tagging practices.

I realize my suggestion requires writing custom code for deriving field names, instead of having the convenience of ArcObjects or ArcPy field name validation, but I would be interested to know if you think this is feasible, to make such an adjustment to the toolbox.

@ThomasEmge ,

The more I am thinking about this, the more I am convinced we need to tackle this. The problem with the current implementation via ArcObjects field name validation is that it is extremely wasteful. Each unsupported char (and especially the OSM colon namespace separator) leads to 4 characters being introduced to accommodate an ASCII number representing the character, while as I wrote entirely above also, 30 characters is the least common denominator for maximum field name length according to ESRI docs.

While I can somewhat understand this implementation for common case scenario's where you don't know anything about the string being provided as the source for the field name, in OSM we do know something about that string: generally speaking, OSM keys should only contain any of the characters in the following collection:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_

And secondly the colon is used as a namespace separator. This is useful information, especially the namespace separator, as, as I suggested in the first post, it can be used to 'intelligently' and predictably abbreviate key names to form field or column names without chopping them off kind of randomly when the limit is reached.

I have now taken a shot at this and developed a Python function to implement what I suggested, that I included below and you could use as a basis for a similar implementation in the OSM Attribute Selector.

A couple of notes:

1) As you can see, I implemented 2 "modes": DBIdentifierLimitEnforcing=STRICT and DBIdentifierLimitEnforcing=LOOSE. With STRICT, the limit for maximum allowable field / column name length defaults to a fixed 30 chars. With LOOSE, there is a configurable setting that you could prompt the user for an entry.
2) In the current implementation, only the characters defined in the example above are allowed. This is a very stringent interpretation (but in my opinion valid). In reality, OSM does not impose this, and any valid unicode character can be used (e.g. see this TagInfo page). However, for the purpose of geodatabase creation, I think a more stringent set of requirements is wise and reasonable. Most non conforming keys will be of little to no use anyway, since the most used keys are very generally conforming to the stringent interpretation.
3) There is another good reason for the restrictions of 2): only in this manner, the 30 chars maximum field length can be guaranteed. Databases generally define max field lengths in terms of bytes, not chars, meaning any unexpected Asian chars for instance, could lead to an unpredictably lower max available field length. By restricting to the ASCII chars defined in 2) only, the 30 chars should be guaranteed.
4) The 'fieldNamePrefix' allows you to supply the used 'osm_' prefix to take into account when determining the maximum limit on OSM key length and column names
5) 'osmTagKey' is of course the original OSM key, so e.g. addr:housenumber
6) As per my first suggestions, tags and namespaces are only abbreviated if they exceed the field name length limit, all other namespaced keys will not be abbreviated, so a key like 'addr:housenumber' should never be abbreviated.

7) Instead of using up 4 chars, the colon namespace separator is simply converted to a single underscore, this saves valuable space.

def getFieldName(osmTagKey,fieldNamePrefix,DBIdentifierLimitEnforcing,DBIdentifierLimit):

validChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_"

if not set(osmTagKey).issubset(validChars):
    return [False,"Cannot convert the OpenStreetMap key '" + osmTagKey + "' due to invalid characters. Only Latin alphabetic " +
            "characters, numbers, colon (:) and underscore (_) are valid characters for OSM keys."]

# Set database identifier limits based either on least common denominator value of 30 chars based on ArcGIS limit ('STRICT'),
# or use a user supplied or database configuration specific limit ('LOOSE').
if DBIdentifierLimitEnforcing == 'STRICT':
    inputLimit = 30 - len(fieldNamePrefix) # Default is four characters reserved for the 'tgc_' prefix of a field name
    outputLimit = 30
elif DBIdentifierLimitEnforcing == 'LOOSE':
    inputLimit = DBIdentifierLimit - len(fieldNamePrefix) # Default is four characters reserved for the 'tgc_' prefix of a field name
    outputLimit = DBIdentifierLimit

# Test for length of the key
if len(osmTagKey) > inputLimit and not (":" in osmTagKey):
    return [False,"An OpenStreetMap key name with >" + str(inputLimit) + " characters has been detected! This tool only supports keys up to " +
            str(inputLimit) + " characters long due to database identifier constraints.\nErrorneous key: " + osmTagKey +
            " (" + str(len(osmTagKey)) + " characters)"]

# Create new field name
fieldName = fieldNamePrefix + osmTagKey.replace(":","_")

# Test if the field name exceeds the output limit, if so, attempt to
# abbreviate the field name by examing the OSM namespaces (if present!) and using the first character
# of each found namespace as a replacement for the full namespace.
if len(fieldName) > outputLimit:

    # Split key using the namespace separator ":"
    if ":" in osmTagKey:
        keyParts = osmTagKey.split(":")

    keyPartCount = len(keyParts)

    fieldName = fieldNamePrefix

    for keyPartIdx in range(0,keyPartCount - 2):
        keyPart = keyParts[keyPartIdx]
        keyPartShort = keyPart[0:1] # Slice string to get first character
        fieldName += keyPartShort + "_"

    # Add last subkey element
    fieldName += keyParts[keyPartCount - 1]

# Check a second time the new field name doesn't exceed the output limit.
# If it does, fail and return an error message explaining the situation.
if len(fieldName) > outputLimit:
    return [False,"The following OpenStreetMap key can not be used with this tool. It exceeeds database identifier constraints of " + str(outputLimit) +
            " characters after conversion to a valid database field name.\nErrorneous key: " + osmTagKey +
            "\nAuto-generated field name: " + fieldName + " (" + str(len(fieldName)) + " characters)"]

return [True,fieldName]

@ThomasEmge

This is an updated and enhanced version of my Python function. The difference with the previous version is that this version will attempt to maintain as much of the original key name in the generated output field name, thereby reducing the loss of information / descriptiveness.

Whereas in the previous version of the code a key like:

generator:output:compressed_air (31 characters)

would result in a database field name of:

osm_g_o_compressed_air (22 characters)

as all namespaces would be collapsed / abbreviated to their first character only, the new version of the code will generate a field name like:

osm_g_output_compressed_air (27 characters)

So the code only abbreviates as many namespaces as necessary to reduce the field name to less than the DBIdentifierLimit (which is 30 characters default with mode "STRICT"), while maintaing full namespaces for those not abvreviated. The code will never abbreviate a namespace partially (e.g. no "outp" for "output"), either its the first character, or the whole namespace ("o" or "output" for "output").

Collapsing / abbreviating begins with the left most namespace, continuing with subsequent namespaces as necessary. So a fictional OSM key like:

generator:output:compressed_air:volume_cubic_meter (31 characters)

yields:

osm_g_o_c_volume_cubic_meter (28 characters)

so with three namespaces having been collapsed / abbreviated to comply with the 30 character limit of mode "STRICT".

Additionally, I introduced a new variable nameSpaceSeparatorReplacement, which is the string used to replace the OSM ":" colon namespace character. Although I have made a variable of this, I would recommend hardcoding it to a single "_" underscore when calling the function, as I suggested before and assumed in the examples above, so as to reduce the usage of precious character space in the output field name.

# Function for converting an OSM key to a field name as created by the 'OSM Attribute Selector' tool of the
# 'ArcGIS Editor for OpenStreetMap'
#
def getFieldName(osmTagKey,fieldNamePrefix,nameSpaceSeparatorReplacement,DBIdentifierLimitEnforcing,DBIdentifierLimit):

    validChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_"

    if not set(osmTagKey).issubset(validChars):
        return [False,"Cannot convert the OpenStreetMap key '" + osmTagKey + "' due to invalid characters. Only Latin alphabetic " +
                "characters, numbers, colon (:) and underscore (_) are valid characters for OSM keys."]

    # Set database identifier limits based either on least common denominator value of 30 chars based on ArcGIS limit ('STRICT'),
    # or use a user supplied or database configuration specific limit ('LOOSE').
    if DBIdentifierLimitEnforcing == 'STRICT':
        inputLimit = 30 - len(fieldNamePrefix) # Default is four characters reserved for the 'tgc_' prefix of a field name
        outputLimit = 30
    elif DBIdentifierLimitEnforcing == 'LOOSE':
        inputLimit = DBIdentifierLimit - len(fieldNamePrefix) # Default is four characters reserved for the 'tgc_' prefix of a field name
        outputLimit = DBIdentifierLimit

    # Test for length of the key
    if len(osmTagKey) > inputLimit and not (":" in osmTagKey):
        return [False,"An OpenStreetMap key name with >" + str(inputLimit) + " characters has been detected! This tool only supports keys up to " +
                str(inputLimit) + " characters long due to database identifier constraints.\nErrorneous key: " + osmTagKey +
                " (" + str(len(osmTagKey)) + " characters)"]

    # Create new field name
    fieldName = fieldNamePrefix + osmTagKey.replace(":",nameSpaceSeparatorReplacement)

    # Test if the field name exceeds the output limit, if so, attempt to
    # abbreviate the field name by examening the OSM namespaces (if present!) and using the first character
    # of each found namespace as a replacement for the full namespace.
    if len(fieldName) > outputLimit:

        # Split key using the namespace separator ":"
        if ":" in osmTagKey:
            keyParts = osmTagKey.split(":")

        keyPartCount = len(keyParts)

        fieldName = fieldNamePrefix

        isBrokeOut = False

        for keyPartIdx in range(0,keyPartCount - 1):

            keyPart = keyParts[keyPartIdx]
            keyPartShort = keyPart[0:1] # Slice string to get first character

            fieldName += keyPartShort + nameSpaceSeparatorReplacement

            fieldNameTmp = fieldName

            # Rebuild field name to see if with the change to the current namespace, the field name is already acceptably abbreviated.
            # If so, exit loop, if not, continue abbreviation of namespaces (if any left...)
            for keyPartIdx in range(keyPartIdx + 1,keyPartCount - 1):
                keyPart = keyParts[keyPartIdx]
                fieldNameTmp += keyPart + nameSpaceSeparatorReplacement

            # Add last subkey element
            fieldNameTmp += keyParts[keyPartCount - 1]

            if len(fieldNameTmp) > outputLimit:
                continue
            else:
                fieldName = fieldNameTmp
                isBrokeOut = True
                break

        if isBrokeOut == False:
            # Add last subkey element
            fieldName += keyParts[keyPartCount - 1]

    # Check a second time the new field name doesn't exceed the output limit.
    # If it does, fail and return an error message explaining the situation.
    if len(fieldName) > outputLimit:
        return [False,"The following OpenStreetMap key can not be used with this tool. It exceeeds database identifier constraints of " + str(outputLimit) +
                " characters after conversion to a valid database field name.\nErrorneous key: " + osmTagKey +
                "\nAuto-generated field name: " + fieldName + " (" + str(len(fieldName)) + " characters)"]

    return [True,fieldName]

Esri / arcgis-osm-editor

Dealing with long OSM key names #165