OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
19 stars 1 forks source link

Tiler Status and Future Improvements #909

Open Rub21 opened 3 days ago

Rub21 commented 3 days ago

Performance issues have been reported in the tiler, where some areas in the map take between 1 and 2 minutes to complete fetching the tiles which generates latency in the tilers, This could be due to several factors:

Tile cache Configuration:

• The current tiler cache configuration clears cached tiles across all zoom levels, from 0 to 20. This means that when a user makes an edit, the system may impact tiles globally based on the list of tiles marked for cleaning. This approach was implemented because of past requirements to ensure that boundary updates are reflected immediately. For example, if a file generated by imposm (like this reference list) is used, the cleaning system processes the file and clears not only the specified tiles but also their parent and child tiles at all zoom levels, from 0 to 20.

This strategy significantly impacts performance, especially in areas with high edit density or frequent updates, which can cause delays in visualizing changes.

Lack of Frequent Reindexing:

• The tiler’s database has been handling a high volume of insert, update, and delete processes performed by imposm. The data base is been running for almos 4 to 5 months with out reindexing . I read that is recommend performing periodic reindexing on the tables, as this significantly improves data selection during queries.

Recently, I reindexed the tables in the public schema, which has resulted in some improvement in response times. However, I believe the results are still not optimal, and it may be necessary to explore other strategies to further optimize performance.

    DO $$
    DECLARE
        tbl RECORD;
        idx_name TEXT;
    BEGIN
        FOR tbl IN
            SELECT tablename
            FROM pg_tables
            WHERE schemaname = 'public'
        LOOP
            SELECT indexname INTO idx_name
            FROM pg_indexes
            WHERE schemaname = 'public' AND tablename = tbl.tablename AND indexname LIKE tbl.tablename || '_pkey';

            IF idx_name IS NOT NULL THEN
                RAISE NOTICE 'Reindexing index: %', idx_name;
                EXECUTE format('REINDEX INDEX %I;', idx_name);
            END IF;
        END LOOP;
    END $$;

Infrastructure

The tile database was running on an x.large machine (4CPU and 16 RAM), constantly operating at its maximum capacity. This could be one of the factors contributing to the significantly long response times of the tiler, as system overload limits its ability to efficiently handle ongoing requests and processes.

So currently, I have migrated the tiler DB to a 2x.large machine (8 CPUs and 32GB RAM), which still occasionally experiences high demand on resources

Improvements that we need to do

Why is it important to cache tiles?

A typical user accessing the application and exploring the data can make requests ranging from zoom level 1 to zoom level 20 within seconds in a specific area. This behavior creates a high demand on the database, as it needs to handle hundreds of tile requests simultaneously.

Moreover, each tile can contain approximately ~27 different layers, which significantly multiplies the amount of data the database must process and deliver. This volume of queries inevitably causes delays, as the database “struggles” to respond to all these requests in real-time.

Using a caching system would be extremely helpful, as it would not only reduce the load on the database but also ensure that tiles remain fast and efficient for users.

1. Initial Actions, Improving the Tiler Cache Flow

The first step I took was to reduce the tile cleaning range, limiting it to zoom levels 8 to 16 https://github.com/OpenHistoricalMap/ohm-deploy/blob/main/images/tiler-server/seed-by-diffs.sh#L33-L38 . Currently, this configuration may work effectively, as it allows tiles at lower zoom levels (1, 2, 3, 4, 5, 6, 7) to remain cached, avoiding significant latencies at these levels.

Advantages: • Enhances performance at lower zoom levels, ensuring a smoother experience for users viewing general map overviews.

Disadvantages: • If edits are made in large areas, such as country or national-level boundaries, these changes will not be reflected in low zoom tiles (1-7) until we update the tiler cache

To mitigate this drawback, I am testing a strategy where low zoom tiles (1-7) are generated or updated every 24 or 48 hours. This methodology is similar to that used by Mapbox. For example: • If a country-level relation, like Peru, is modified, the changes will not appear in the tiles for at least 24 hours. • On the other hand, if a street is edited, the changes will be reflected within the next minute or hour.

I am currently testing this strategy manually, and it seems to be working well. When viewing the map, tiles for zoom levels below 7 exhibit significantly reduced latency.

Additionally, it is important to keep key cities cached. I tested caching for some main cities in the U.S. and Europe using the application: https://github.com/OpenHistoricalMap/tiler_seed_cache/tree/tegola . Take a look to assess if this could help mitigate latency further.

This process is still manual, but if we agree to continue with this cache flow, I can automate the cache generation process.

2. Improvements in tiler queries

While previous improvements to the tiler have included geometry simplification and converting polygons to lines to reduce tile size https://github.com/OpenHistoricalMap/issues/issues/702 , there is a critical point we haven’t addressed: these PostgreSQL functions are also performance-intensive, especially when executed repeatedly, as they take time.

To optimize further, the following approach is recommended:

Currently, JSONB is used in PostgreSQL to extract 411 languages - https://github.com/OpenHistoricalMap/issues/issues/679 , but eventually, this data should be stored in dedicated columns in the database. This change could also help reduce query times and improve overall efficiency.

3. Periodic Reindexing

Regular reindexing of indexes could ensure that our tables are optimized, improving query performance and enabling more efficient data access. This should be added as a cron job to run reindexing at regular intervals.

4. Infrastructure

Currently, I have migrated the tiler database to a 2x.large machine with the appropriate configuration. However, even with this change, the database still experiences response delays during high traffic or a large volume of queries per second.

We need to find a balance where the resource is not too expensive, yet our tiles remain fast. One potential solution could be using an external service outside of AWS for our tiler database https://github.com/OpenHistoricalMap/issues/issues/894 as there are less costly infrastructures with high capacity. Additionally, since the data is not mission-critical and can be regenerated as needed, this option might be more cost-effective and scalable.

5. Cache Cleaning and Cache Seeding

Cache cleaning is a critical process triggered after each changeset. A more efficient strategy would be to combine cleaning with automatic cache generation for the cleared tiles. This approach could reduce latency and provide a faster experience for users needing immediate visualization of their changes.

To implement this workflow, it would be ideal to use scalable jobs that can quickly increase the number of cleaning containers based on demand. Currently, we have a single cleaner, which tends to collapse under heavy cache cleaning requests. This issue was previously addressed internally with scripts that, upon detecting the container reaching its limit, would terminate all cache processes and automatically clear the entire tiler cache in S3.

This is something I have been thinking about and testing to create a faster tiler experience for OSM users. I would like you input here .

cc. @1ec5 @danrademacher @jeffreyameyer @batpad