Performance of OSM Attribute Selector

mboeringa commented 8 years ago

Hi @ThomasEmge

Now you've tackled the performance of the loading process in such a great way, something else in a potential rendering workflow attracts more attention: the performance of the OSM Attribute Selector. This tool is vital for extracting tags needed in rendering OSM data in ArcGIS, yet its performance is quite slow. This has become even more apparent with the new OSM File Loader. Loading a dataset can now be even faster than the actual extraction of attributes. Of course, with many tags selected, there is a potential massive increase in the number of columns and table size once you run the OSM Attribute Selector, but I doubt that is the whole story.

What is most concerning though, is that I have noticed that actually exporting a table with already extracted tags to an entire new table, is considerably faster than doing the actual extraction process of the OSM Attribute Selector.

E.g. exporting the entire Polygon table for the Netherlands (close to 12M records), takes just 45 minutes based on a conventional hard drive using the Feature Class to Feature Class tool. That is a table including the 70 keys / columns as mentioned below.
Adding 70 keys using the OSM Attribute Selector (yes, I need all those 70 keys), took just over 24 hours for the same 12M record table in a file geodatabase feature class stored on an SSD.

This is certainly not really surprising, as I am well aware the OSM Attribute Selector must do additional work: it needs to parse the tags from XML or JSON or whatever format of key/value storage is used in the binary osmTags field. But taking into account the difference above, and that a database like PostgreSQL seems to be capable of providing direct read queries on HStore and JSONB key/value storages, and index them, it begs the question if there isn't a more performant solution on the horizon that could close this gap of a 25x times slower process of parsing and writing versus a plain table copy operation on a table having a similar amount of records and fields.

This is just a generic question: are there any options for optimizing this process of tag extraction using the OSM Attribute Selector?

ThomasEmge commented 8 years ago

Creating additional fields after loading the data is slow. The time is proportional to the number of existing rows in the feature class. You might want to change your workflow to create your required fields when you load the data. The old OSM loader will let you create the fields (schema) when you load the file and then as a second step you need to extract the attributes with the selector tool. You can use the EXISTING parameter for extracting the values of the created fields. The new OSM loader will also let you specify the fields at loading time and it will populate the values at the same time.

mboeringa commented 8 years ago

You might want to change your workflow to create your required fields when you load the data.

That would be a huge technical change to my renderer, essentially breaking most of the existing functionality I build in the past three years. This is really no option.

My current setup is style based, and each style can have multiple render rules each defining their own tag set. Each render rule can be written out to its own table, so with a unique set of fields. This option can be set for each render rule specifically, so you can render from just three tables: the Point, Polyline and Polygon feature classes originally created by the editor's load tools, or create multiple tables for different thematic layers.

If I toggle a few switches in the tools, I can choose to render and create up to 350(!) different tables / ArcGIS Feature Classes (for the 350 render rules in my main style) in a single render session, each having its own set of columns defined by the user as keys in a render rule. This is actually a technically highly optimized process: the Modelbuilder models and especially arcpy Python script will continuously examine / track the existing schema, and only extract an OSM key with the OSM Attribute Selector if it doesn't already exist. It never extracts things twice waisting resources and processing time, even if a key has been specified or is used in SQL Definition Queries of render rules potentially dozens of times.

Up to now, my renderer has always assumed that the feature classes were already created by the user, it did not incorporate the step of actually loading the data using the Load OSM File tool, or now the new OSM File Loader.

Actually, one geodatabase can - with some limitations - technically support multiple styles. It will just dynamically load the additional keys missing in the existing geodatabase using the OSM Attribute Selector, after it has examined the existing schema of the base Point, Polyline and Polygon tables.

It would still be nice if the OSM Attribute Selector could do its job quicker...

ThomasEmge commented 8 years ago

I'll do a profile on the tool and see where most of the time is spent.

ThomasEmge commented 8 years ago

I did compare the 2 different types of cursors (update vs. search) for the file geodatabase and the code is already using the fastest one. Unless we would rewrite the tool at a lower COM level I think the current speed is it.

mboeringa commented 8 years ago

Unless we would rewrite the tool at a lower COM level I think the current speed is it.

I guess the answer is that this is a no-go area, but still, how difficult would this be if you needed to classify this:

easy
moderate
tough-but-doable
really hard

I am asking because of the observation I made in my original post: there is a 25x times speed difference in running the Feature Class to Feature Class tool against a table already having all key fields, and extracting all key fields against the same table not yet having these keys using the OSM Attribute Selector. Since the number of records is the same in both instances, it begs the question what causes this huge difference, and if it could be overcome, at least partially.

It really would still be of great use if the OSM Attribute Selector had improved performance.

Anyway, regarding your lower COM level suggestion: is that "future" proof in the light of a possible transition to Pro? What is the equivalent solution for Pro if not?

ThomasEmge commented 8 years ago

My guess would be moderate. It is the tool itself and a fair amount of helper classes as well.

mboeringa commented 7 years ago

My guess would be moderate. It is the tool itself and a fair amount of helper classes as well.

@ThomasEmge

Just to highlight the necessity of a faster OSM Attribute Selector: I am currently rendering France. For that, the OSM Attribute Selector must extract the OSM keys for the Polygon, Polyline and Point tables. Especially the Polygon table, due to the diverse rendering rules and large number of records (France has all cadastral building imported) is very slow.

It is currently at 10% after one day, which means the extraction process using the OSM Attribute Selector will take 10 days for the Polygon table alone...

If I choose to render each render rule to its own dedicated Feature Class (meaning 340 FCs), it would take a multiple of that (although not many render rules have as much records as the buildings table, so processing time would definitely not be 340x10 days, but probably more in the range of a month or so).

Scaling this up to Europe, which is currently represented by a Geofabrik extract about 5x bigger than France, it would mean 50 days processing for the Polygon table alone, and then the Polyline and Point tables still need to be done (although with my current rendering rules, they are must faster than the Polygon table processing).

Again, considering the 25x gap between a simple Feature Class to Feature Class processing step and the performance of the attribute selector, there is much to gain. It should potentially be possible to render Europe (after an estimated 10 days import using the OSM File Loader tool), in another 10 days total or so.

Esri / arcgis-osm-editor

Performance of OSM Attribute Selector #148