OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
19 stars 1 forks source link

Errors with Proxy_fcgi and CGImap When Uploading Large Data Sets #783

Closed Rub21 closed 4 months ago

Rub21 commented 6 months ago

From: https://developmentseed.slack.com/archives/CACTQ7MU4/p1715276861543709 image

We're experiencing timeout errors (proxy_fcgi:error) with our Apache and cgimap setup when uploading large datasets to our OHM-api web. During large data uploads, the server frequently logs proxy_fcgi:error indicating that the timeout specified has expired. This error seems to occur mainly when the load on the server is high.

App 8444 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[ N 2024-05-10 15:41:05.8218 349/T3 age/Cor/CoreMain.cpp:1146 ]: Checking whether to disconnect long-running connections for process 8536, application /var/www (production)
App 9778 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[Fri May 10 15:46:07.882219 2024] [proxy_fcgi:error] [pid 360:tid 140526506424000] (70007)The timeout specified has expired: [client 10.10.47.154:38824] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:47:19.762287 2024] [proxy_fcgi:error] [pid 359:tid 140526422505152] (70007)The timeout specified has expired: [client 10.10.47.154:43766] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:48:31.410450 2024] [proxy_fcgi:error] [pid 360:tid 140526600832704] (70007)The timeout specified has expired: [client 10.10.47.154:56726] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:49:43.226386 2024] [proxy_fcgi:error] [pid 359:tid 140526453974720] (70007)The timeout specified has expired: [client 10.10.47.154:49158] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:50:54.714215 2024] [proxy_fcgi:error] [pid 360:tid 140526516913856] (70007)The timeout specified has expired: [client 10.10.47.154:56044] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:52:06.750295 2024] [proxy_fcgi:error] [pid 359:tid 140526537893568] (70007)The timeout specified has expired: [client 10.10.47.154:45536] AH01075: Error dispatching request to : (polling)
App 11208 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[ W 2024-05-10 16:36:17.4959 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29834] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:37:41.8413 349/T7 age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 1-29835] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:02.0024 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29835] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:06.3191 349/T7 age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 1-29836] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:42.0237 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29836] Returning HTTP 503 due to: Request queue full (configured max. size: 100)

Full log in /var/log/apache2/error.log: https://gist.github.com/Rub21/87fd55f82a5c83756cc81fd6385c8daf

cc. @jeffreyameyer @1ec5 @batpad @danrademacher

jeffreyameyer commented 6 months ago

Is fixing this as simple as adjusting the timeout duration or is there an underlying other problem?

Rub21 commented 6 months ago

Is fixing this as simple as adjusting the timeout duration or is there an underlying other problem?

Not really, I have already tested increasing the timeout, but it did not work. I'm still trying to find out where exactly the error is.

@batpad, do you have any hunch about this issue?

Rub21 commented 6 months ago

I've been running many tests, uploading large quantities of objects to the API. When there's a large amount of data to process, a 504 error from the API appears, specifically from CGIMAP. Here JOSM's logs https://gist.github.com/Rub21/1bb46bb004f89fdda787fc96b896f2a3

2024-05-13 10:56:50.809 INFO: Gateway Time-out
2024-05-13 10:56:50.809 INFO: Waiting 10 seconds ...
2024-05-13 10:57:00.852 INFO: OK - trying again.
2024-05-13 10:57:00.852 INFO: Starting retry 2 of 5.
2024-05-13 10:57:00.871 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118027/upload (335 kB) ...
2024-05-13 10:58:01.985 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118027/upload -> HTTP/1.1 504 (1 min 0 s; 160 B)
2024-05-13 10:58:02.003 INFO: Gateway Time-out
2024-05-13 10:58:02.003 INFO: Waiting 10 seconds ...

When the web API and the database are started, everything works well, and the API supports a large amount of data. However, as more data is gradually added, it starts to throw 504 errors, I checked the database, and it seems that the bottleneck is in the number of connections to the database, as there are more and more data to upload the cgimap start trowing the error

I have increased the number of connection once more:

Rub21 commented 6 months ago

The issues is still persisting, currently we are using cgimap version about 10 months ago https://github.com/zerebubuth/openstreetmap-cgimap/tree/5cd3d21bebe9d205828608be4c65bbda8b464308, after that there were a lot changes, I am going to update the version of cgimap, and see how it goes.

mmd-osm commented 6 months ago

So what’s inside CGImap log files?

Rub21 commented 6 months ago

hey @mmd-osm , Thanks for taking a look at this issue. Here are the cgimap and JOSM logs: https://gist.github.com/Rub21/6b66bac8bc6912d8398677f3bad8f7b2. During the last upload, a timeout error appeared, which led to some incomplete ways or relations and heavy loads. e.g 👇 , However, cgimap does not show any unusual errors.



<img width="778" alt="image" src="https://github.com/OpenHistoricalMap/issues/assets/1152236/9bc21164-333b-4fa7-a2d1-78b0704d0a9f">

Same issue in the both version  :
- 10 months ago: https://github.com/zerebubuth/openstreetmap-cgimap/tree/5cd3d21bebe9d205828608be4c65bbda8b464308,
- 1 month ago: https://github.com/zerebubuth/openstreetmap-cgimap/tree/24709f19da0ece205b8c9e8f2e9a556822236b67

I have already added a ProxyTimeout=600, but it is not working because the issue is still there: https://github.com/OpenHistoricalMap/ohm-deploy/blob/cgimap/images/web/config/production.conf#L29.
mmd-osm commented 6 months ago

Thanks for sharing the CGImap log files. From what I can see there, we have some fairly long query times for two queries involving the current_way_nodes table:

[2024-05-14T19:56:28 #311] Started request for map(-77.2450447,-12.1438929,-77.1885681,-12.0937100) from 10.10.47.154
[2024-05-14T19:57:34 #311] Executed prepared statement nodes_from_ways in 65350 ms, returning 1601 rows, 1601 affected rows

Statement:

   SELECT DISTINCT wn.node_id AS id FROM current_way_nodes wn WHERE wn.way_id = ANY($1)
[2024-05-14T20:06:11 #307] Executed prepared statement current_way_nodes_to_history in 55163 ms, returning 0 rows, 18652 affected rows

[2024-05-14T20:07:34 #305] Executed prepared statement current_way_nodes_to_history in 45074 ms, returning 0 rows, 18652 affected rows

Statement:

   INSERT INTO way_nodes (way_id, node_id, version, sequence_id)
       SELECT  way_id, node_id, version, sequence_id 
       FROM current_way_nodes wn
       INNER JOIN current_ways w
       ON wn.way_id = w.id
       WHERE id = ANY($1) )

As a result of these long run times, it seems that JOSM is trying to upload to the same changeset another time, while the first upload is still running. This could happen due to the Gateway timeout, in which case JOSM doesn't know if the data has already been successfully uploaded or not.

By the way, if you happen to have less than 10'000 changes to be uploaded, you could try uploading them all at once, instead of using the chunked upload with a chunk size of 2000. This way, the upload would be either complete or fail altogether, rather than some mixed state, where only parts of the changes have been uploaded.

Here's an example involving process ids 305 and 307:

[2024-05-14T20:05:15 #307] Started request for changeset/upload 118101 from 10.10.47.154
[2024-05-14T20:06:27 #305] Started request for changeset/upload 118101 from 10.10.47.154

--> starts another upload for the same changeset in another process

You can see a bit further down that the second process tries to lock changeset 118101 again, but needs to wait for process 307 to finish first. Process 307 spends most of the time in the statement "current_way_nodes_to_history".

[2024-05-14T20:06:48 #305] Executed prepared statement changeset_current_lock in 21433 ms, returning 1 rows, 1 affected rows

Maybe you could investigate in more detail, why both statements I mentioned above take so much time. I find this rather unusual.

Rub21 commented 6 months ago

Thank you, @mmd-osm. I will check those queries. Also sometimes, a 409-Conflict error is triggered in the cgimap log when trying to upload the changeset again and again, This occurs when attempting to upload 27k objects.https://staging.openhistoricalmap.org/changeset/118113

[2024-05-15T02:04:38 #307] Executed prepared statement get_user_id_pass in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement roles_for_user in 0 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement check_user_blocked in 0 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement is_user_active in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:04:38 #307] Started request for changeset/upload 118113 from 10.10.47.154
[2024-05-15T02:04:38 #307] Executed prepared statement changeset_exists in 6 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement changeset_current_lock in 124835 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement insert_tmp_create_nodes in 3 ms, returning 229 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement lock_current_nodes in 0 ms, returning 229 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement current_nodes_to_history in 6 ms, returning 0 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement current_node_tags_to_history in 1 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement calc_node_bbox in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement insert_tmp_create_ways in 26 ms, returning 7651 rows, 7651 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement lock_current_ways in 16 ms, returning 7651 rows, 7651 affected rows
[2024-05-15T02:06:44 #307] Executed prepared statement lock_future_nodes_in_ways in 228 ms, returning 30226 rows, 30226 affected rows
[2024-05-15T02:06:44 #307] Executed prepared statement insert_new_current_way_tags in 172 ms, returning 0 rows, 18061 affected rows
[2024-05-15T02:06:45 #307] Executed prepared statement insert_new_current_way_nodes in 1071 ms, returning 0 rows, 50813 affected rows
[2024-05-15T02:06:45 #307] Executed prepared statement current_ways_to_history in 112 ms, returning 0 rows, 7651 affected rows
[2024-05-15T02:06:49 #307] Executed prepared statement current_way_tags_to_history in 4102 ms, returning 0 rows, 18061 affected rows
[2024-05-15T02:07:25 #307] Executed prepared statement current_way_nodes_to_history in 35572 ms, returning 0 rows, 50813 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement calc_way_bbox in 26231 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_tmp_create_relations in 3 ms, returning 34 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement lock_current_relations in 0 ms, returning 34 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement lock_future_ways_in_relations in 0 ms, returning 71 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_new_current_relation_tags in 4 ms, returning 0 rows, 163 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_new_current_relation_members in 5 ms, returning 0 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relations_to_history in 6 ms, returning 0 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relation_tags_to_history in 4 ms, returning 0 rows, 163 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relation_members_to_history in 7 ms, returning 0 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement calc_relation_bbox_nodes in 1 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:10:28 #307] Executed prepared statement calc_relation_bbox_ways in 157574 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:10:28 #307] Returning with http error 409 with reason The changeset 118113 was closed at 2024-05-15 02:04:38 UTC
mmd-osm commented 6 months ago

You can’t upload more than 10k changes in a single changeset, unless you have increased the default value. JOSM should automatically split 27k changes into 3 changesets.

Rub21 commented 6 months ago

Yes, JOSM split the changeset into three chunks. The first two went okay, but the last one threw a 409 error.

Rub21 commented 6 months ago

For now, I have updated the configuration, increasing the number of object , https://github.com/OpenHistoricalMap/ohm-website/blob/staging/config/settings.yml#L36-L40, part of adding ProxyTimeout

mmd-osm commented 6 months ago

I commented on the ProxyTimeout here as well: https://github.com/OpenHistoricalMap/ohm-website/pull/245#issuecomment-2126722651

From the log above it seems that something is terminating the connection after exactly 60s. I think that's caused by the FastCGI proxy. There are some settings to control the wait time like FcgidIOTimeout (The FastCGI application must begin generating the response within this period of time. Increase this directive as necessary to handle applications which take a relatively long period of time to respond.). In this scenario, CGImap would only start sending data after all database statements have been processed. If this is taking so much time in your environmnt, FastCGI proxy might terminate the connection too early.

Then, like I've mentioend before, it makes sense to look into PostgreSQL response times a bit closer, and do some more tracing / analysis.

2024-05-14 15:09:13.172 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118102/upload -> HTTP/1.1 504 (1 min 0 s; 160 B)
2024-05-14 15:09:13.177 INFO: Gateway Time-out
batpad commented 6 months ago

cc @bitner to potentially get help with your 👀 on some of the postgres stuff here.

bitner commented 5 months ago

At first look, these settings should be updated in the system conf (this is looking at staging which appears to have 8G Ram)

shared_buffers = 2GB # 1/4 system memory effective_cache_size = 6GB # 3/4 system memory maintenance_work_mem = 512MB random_page_cost = 1.1 effective_io_concurrency = 200 work_mem = 20MB

bitner commented 5 months ago

For this query:

     SELECT  way_id, node_id, version, sequence_id 
     FROM current_way_nodes wn
     INNER JOIN current_ways w
     ON wn.way_id = w.id
     WHERE id = ANY($1);

Changing to

     SELECT  way_id, node_id, version, sequence_id 
     FROM current_way_nodes wn
     INNER JOIN current_ways w
     ON wn.way_id = w.id
     WHERE way_id = ANY($1);

seems to make a pretty huge difference on the list of way ids that @Rub21 gave me for slow queries

Rub21 commented 5 months ago

@bitner , here is some expensive queries, in the comment of the gist: https://gist.github.com/Rub21/18e89c26f3148132d1e57c6f438fedb2

bitner commented 5 months ago

For this case:

WITH new_way_nodes(way_id, node_id, sequence_id) AS (
             SELECT * FROM
             UNNEST( CAST($1 AS bigint[]),
                     CAST($2 AS bigint[]),
                     CAST($3 AS bigint[])
                  )
          )
          INSERT INTO current_way_nodes (way_id, node_id, sequence_id)
          SELECT * FROM new_way_nodes

Where are the three arrays coming from? If you could just zip them together and on the client side create the three columns and then use COPY to load that into current_way_nodes rather than dealing with exploding and reassembling the arrays into a table, you could be much more efficient.

bitner commented 5 months ago

It does look like changing the random_page_cost from 4 to 1.1 helps this query quite a bit allowing it to chose to use the index rather than a sequential scan.

select distinct wn.node_id as id from current_way_nodes wn where wn.way_id = ANY($1);
jeffreyameyer commented 5 months ago

@bitner - many thanks for these suggestions. Really appreciated! Next time I'm in MSP, wild rice and walleye on me.

mmd-osm commented 5 months ago

client side create the three columns and then use COPY to load that into current_way_nodes rather than dealing with exploding and reassembling the arrays into a table, you could be much more efficient.

I tested this approach in the real code but couldn't see any significant performance improvement. For the time being, I'd recommend to tune PostgreSQL settings. We anyway can't reproduce the issues OHM is experiencing in OpenStreetMap production or test systems, even though the data volume is several orders of magnitude larger.

Rub21 commented 5 months ago

I have updated the values in production DB for a 10GB of RAM: https://github.com/OpenHistoricalMap/ohm-deploy/pull/332, let's see how it goes.

Rub21 commented 5 months ago

I found the issue related to 504 (1 min 0 s; 160 B).

2024-05-14 15:10:24.669 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118102/upload -> HTTP/1.1 504 (1 min 0 s; 160 B)
2024-05-14 15:10:24.681 INFO: Gateway Time-out

It is related to ingress in Kubernetes. By default, the proxy timeout in ingress is 1 minute, which was the reason it was cutting off uploads that took more than 1 minute. i have increased the proxy timeout to 10 minutes, that should be enough.

jeffreyameyer commented 5 months ago

I'll test with some large datasets and see what happens.

jeffreyameyer commented 5 months ago

Hmmm... I haven't really been uploading any "large" changesets, but it does appear we have a problem with them not being closed.

Monosnap Open changesets 2024-06-26 18-12-42
mmd-osm commented 5 months ago

JOSM has a setting in the upload dialog to close changesets right away after uploading, or keep them open for longer.

jeffreyameyer commented 4 months ago

Ok... I just uploaded a lot of data using JOSM and they all went very well, until the very end, when JSOM hung on the last upload set.

Monosnap Uploading data for layer 'Data Layer 1' 2024-07-06 09-56-21

There's a fair amount of discussion on upload-related issues on Discord right now.

jeffreyameyer commented 4 months ago

Now, JOSM is unable to close changesets... I've already asked it to close these and the API timed out:

Monosnap Open changesets 2024-07-06 10-25-46 Monosnap Communication with OSM server failed 2024-07-06 10-31-00

This is showing, even though the API seeems to be working for iD and other edits.

jeffreyameyer commented 4 months ago

whoops... the site says it is under heavy load...

jeffreyameyer commented 4 months ago

Ok, website is back up, and I have a single changeset to close, but JOSM is unable to close that changeset.

@mmd-osm - I have JOSM set to close open changesets after uploading, but when the attempted uploads are arrested, it never gets to the end of the upload and leaves the changeset open.

jeffreyameyer commented 4 months ago

Ok, even with non-large (e.g., a few hundred objects) uploads, I'm running into the following behavior:

This sounds bizarre, but I'm wondering if the requests to close changesets are bringing the site down. As soon as JOSM stopped trying to close a changeset just now, the site came back up. Maybe coincidence?

DavidJDBA commented 4 months ago

"Heavy load" messages yesterday, but not this morning. However...

In a JOSM session, I completed two or three rapid and successful uploads this morning. Then this:

image Despite the error, changeset 127444 appears to be complete and in the database. All changes involved preexisting highways and included:

    • creating relations for highway east and west directions
    • combining those into a single parent relation for that highway segment.
    • in one of the successful uploads, adding those parent relations to a statewide highway relation for a specific time period.

The error message was after downloading a separate area, about 100 miles east of the first and into the same layer, completing steps 1 and 2 above and then uploading.

jeffreyameyer commented 4 months ago

Ok... this may be related or unrelated, but I'm having problems when deleting 4 relations in 1 changeset.

Here's what happens (in JOSM):

Monosnap Uploading data for layer 'Data Layer 1' 2024-07-07 11-43-50 Monosnap Uploading data for layer 'Data Layer 1' 2024-07-07 11-54-27
DavidJDBA commented 4 months ago

Uploaded this I-70 relation and encountered the usual slowness and error message. Only did this once, but new copies keep appearing... image

P.S. This may have resulted from running the upload in the Background (unintentional button push). Multiple instances stopped appearing after killing JOSM and rebooting. Cleaning this up now.

Rub21 commented 4 months ago

Doing an evaluation with the same large changeset as Jeff did.

Here is my configuration:

Screenshot 2024-07-07 at 9 58 19 PM Screenshot 2024-07-07 at 10 03 00 PM Screenshot 2024-07-07 at 10 03 10 PM

My understanding of how it works is as follows:

1.  First, upload the points; it generates point IDs.
2.  Second, upload the ways, add the point IDs into the way, and generate its own way ID.
3.  Third and last, upload the relations, which are formatted by way, and this generates its own relation ID.

After stating that, here is what is happening when uploading a large number of objects in my case:

The first changeset that CGIMAP creates is: https://www.openhistoricalmap.org/changeset/127806. Here are the logs in CGIMAP:

[2024-07-08T03:03:23 #337] Started request for changeset/upload 127806 from 10.10.2.240
[2024-07-08T03:03:23 #337] Executed prepared statement changeset_exists in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement changeset_current_lock in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement insert_tmp_create_nodes in 44 ms, returning 10000 rows, 10000 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement lock_current_nodes in 21 ms, returning 10000 rows, 10000 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement current_nodes_to_history in 142 ms, returning 0 rows, 10000 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement current_node_tags_to_history in 25 ms, returning 0 rows, 0 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement calc_node_bbox in 16 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:23 #337] Executed prepared statement changeset_update_w_bbox in 1 ms, returning 0 rows, 1 affected rows
[2024-07-08T03:03:23 #337] Completed request for changeset/upload 127806 from 10.10.2.240 in 422 ms returning 620304 bytes
[2024-07-08T03:03:25 #342] Executed prepared statement get_user_id_pass in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:25 #342] Executed prepared statement roles_for_user in 0 ms, returning 0 rows, 0 affected rows
[2024-07-08T03:03:25 #342] Executed prepared statement check_user_blocked in 0 ms, returning 0 rows, 0 affected rows
[2024-07-08T03:03:25 #342] Executed prepared statement is_user_active in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:25 #342] Started request for changeset/upload 127806 from 10.10.2.240
[2024-07-08T03:03:25 #342] Executed prepared statement changeset_exists in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:25 #342] Executed prepared statement changeset_current_lock in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:03:25 #342] Returning with http error 409 with reason The changeset 127806 was closed at 2024-07-08 03:03:25 UTC

The last line shows: Returning with http error 409 with reason The changeset 127806 was closed at 2024-07-08 03:03:25 UTC, which I think is not an error since the whole 10,000 points have already been uploaded.

It shows the same logs until https://www.openhistoricalmap.org/changeset/127828, which is still uploading nodes.

In the changeset https://www.openhistoricalmap.org/changeset/127829, the ways (2,201 way objects) and relations (262 relation objects) start uploading to the database.

Here are the CGIMAP logs: https://gist.github.com/Rub21/d6eb1122423b347fe1f421b25f97649e. From the logs, these functions take a lot of time to complete:

[2024-07-08T03:12:45 #22673] Executed prepared statement calc_relation_bbox_ways in 438161 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:20:42 #16749] Executed prepared statement calc_relation_bbox_ways in 416562 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:34:11 #337] Executed prepared statement calc_relation_bbox_ways in 265212 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:38:43 #338] Executed prepared statement calc_relation_bbox_ways in 262971 ms, returning 1 rows, 1 affected rows

and

[2024-07-08T03:05:18 #22673] Executed prepared statement changeset_current_lock in 0 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:12:45 #16749] Executed prepared statement changeset_current_lock in 378328 ms, returning 1 rows, 1 affected rows
[2024-07-08T03:34:11 #338] Executed prepared statement changeset_current_lock in 1400022 ms, returning 1 rows, 1 affected rows

JOSM logs: https://gist.github.com/Rub21/619ca9ce925f851c62b59222dd06edbf

Also, something important to consider is in the changeset a message stick for around 1 hour.

image

And according to the CGIMAP logs, it takes around 35 minutes to complete uploading the objects since the first changeset at 03:03:23 and completes the last changeset at 03:38:43.

This is the first evaluation and we know that the bottleneck is in some function in CGIMAP. but also currently, our ingress and proxy timeout is configured for 10 minutes, but it seems that it needs to be increased. We should also check for maybe more resources for CGIMAP,

mmd-osm commented 4 months ago

Thank you for the detailed analysis. As a reminder, whatever CGImap writes to the log files in terms of „Executing prepared statement..“ (and others) is actually almost exclusively happening on the PostgreSQL side of the house.

So my main question would be, where did the database spend all the time? It could be I/O heavy activities like vacuum going on in parallel, in general too slow io for the data volume, not enough resources for the db, issues due to config settings, etc. This is super difficult to diagnose from outside.

For additional context, here is how the upload performs on osm.org production: https://prometheus.osm.org/d/5rTT87FMk/web-site?orgId=1&refresh=1m&viewPanel=16 -> „upload“.

I don’t know if the test data is available for download in osm xml format. We could give it a try on the osm dev instance to have another reference point.

jeffreyameyer commented 4 months ago

I can echo @DavidJDBA 's observation that duplicate relations are getting created. I've seen this on numerous occasions.

You can see it visually here... the lighter areas are where there is a single relation. The darker areas are where there are multiple.

Monosnap OpenHistoricalMap 2024-07-08 07-20-48
Rub21 commented 4 months ago

hey @mmd-osm , thanks for the clarifications about what is executing in the DB, I could check actions to add more resources for the database and adjust the configurations.

I don’t know if the test data is available for download in osm xml format. We could give it a try on the osm dev instance to have another reference point.

here is the data to upload for testing, it just require open in JOSM and start uploading natural_osm_test.osm.zip

Rub21 commented 4 months ago

Current status of resources for the production DB, The production API DB is using a machine m5.2xlarge (32.0 GiB, 8 vCPUs) together with the tiler DB. Our Kubernetes configuration allows both databases to use as much as required, but it seems when there are large changesets, both databases start requiring more resources: first when the changeset starts updating, and then when the tiler data starts running the cache cleaner.

This configuration was set up two years ago, and it seems it now needs to be updated.

What I could propose:

@jeffreyameyer @batpad @danrademacher what do you think?

mmd-osm commented 4 months ago

@Rub21 : thanks a lot for the test data. I was able to upload the sample data in 42 seconds on the OSM dev instance. For reference, I have attached the JOSM console output below. You can also find the changesets here: https://master.apis.dev.openstreetmap.org/user/mmd2mod/history or check them through the "Map Data" view: https://master.apis.dev.openstreetmap.org/#map=14/34.7014/-79.3098&layers=D

log_sample_upload.txt

Rub21 commented 4 months ago

@mmd-osm that is fast, Can i ask what the size of CPU and RAM from OSM production and dev ?

mmd-osm commented 4 months ago

You can find all hardware specs here: https://hardware.openstreetmap.org/ The dev server ("faffy") has the following specs: https://hardware.openstreetmap.org/servers/faffy.openstreetmap.org/

OSM production uses spike-0[6..8] as frontend servers, and snap-01 as primary db server.

I think it would be prohibitively expensive to run a site with these specs in a cloud environment. Here we're talking about hosting osmf-owned hardware in different data centers.

jeffreyameyer commented 4 months ago

@mmd-osm - ha! It looks like the primary database server has 504GB of memory. Given that we're having issues running on 16GB, is that the likely source of our issues or are there still postgis bottlenecks independent of memory...?

jeffreyameyer commented 4 months ago

and... as always... thanks for your help!

Rub21 commented 4 months ago

Jeff as you mention, OSM database has 48 CPU and 504 GB of RAM, and each API server is using 32 CPU and 63 of RAM.

mmd-osm commented 4 months ago

PostgreSQL could be slow due to limited memory. Given that we have seen severe issues with massive uploads, I'm also suspecting that there might be further issues due to slow I/O. We might be hitting the limits of what the storage system can handle.

OSM production is now at almost 13TB in total database size. So it's very unlikely you would need anything close to 504GB of memory for the db.

mmd-osm commented 4 months ago

By the way, you can also check out CPU and RAM usage for all servers. The three frontend servers are running a Rails port instance next to CGImap. CGImap itself has super low CPU + RAM requirements, you could even run it on a Raspberry Pi. Where you need resources is really on the database server.

https://prometheus.openstreetmap.org/d/Ea3IUVtMz/host-overview?orgId=1&refresh=1m&var-instance=spike-08

Rub21 commented 4 months ago

In terms of the DB size, we are at 92GB out of 600GB. The EBS family we are using is gp2, which is for general purpose. It offers a balance between cost and performance, according to the documentations it is good for small and medium databases.

What would be required is to use the EBS family io2 or gp3, which is recommended for the amount of data we are handling.

jeffreyameyer commented 4 months ago

Let's keep this open until after we've changed resources and done additional testing. There are some oddities in this system (e.g., loading first few groups of XXX in an upload quickly, the always hanging on the last group and changesets not closing) that don't seem resource related, but I could be wrong.

mmd-osm commented 4 months ago

It is kind of expected that the first n uploads a very quick, and the very last one(s) take most of the time. It's simply due to the way JOSM sorts the data for upload: it will start with nodes, later followed by ways and relations.

Nodes are fairly cheap to process, in particular if they don't have any tags (like some >90% of all nodes). Ways and relations both have more tags, and more importantly quite a number of way nodes or relation members in the worst case.

In the log files I've posted earlier on, you can see that each upload took around 500ms, except for the last one, which took around 10s. I would assume that the number of rows to be inserted in the db is about 10x larger in that case. Now with the slower processing in your environment, the last changeset it likely taking much more time than those 10s.

jeffreyameyer commented 4 months ago

Gotcha. So, the client loads a queue first, then the queue is processed all at once on the server.

But... would that fully explain why we have changesets closing on the server and the client not receiving notification? Or, the "Failed to open a connection" error when attempting to close a changeset?

On Mon, Jul 8, 2024 at 12:53 PM mmd @.***> wrote:

It is kind of expected that the first n uploads a very quick, and the very last one(s) take most of the time. It's simply due to the way JOSM sorts the data for upload: it will start with nodes, later followed by ways and relations.

Nodes are fairly cheap to process, in particular if they don't have any tags (like some >90% of all nodes). Ways and relations both have more tags, and more importantly quite a number of way nodes or relation members in the worst case.

In the log files I've posted earlier on, you can see that each upload took around 500ms, except for the last one, which took around 10s. I would assume that the number of rows to be inserted in the db is about 10x larger in that case. Now with the slower processing in your environment, the last changeset it likely taking much more time than those 10s.

— Reply to this email directly, view it on GitHub https://github.com/OpenHistoricalMap/issues/issues/783#issuecomment-2215071220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALM4EQXY5DCFNPLJC2WZW3ZLLU45AVCNFSM6AAAAABHQ7XG7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJVGA3TCMRSGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Jeff Meyer Let's talk - see when I'm free & set up a meeting https://calendar.app.google/ULpaSyfUUgFnc1xt7 @. | 206-676-2347 ohm: OpenHistoricalMap (OHM) https://www.openhistoricalmap.org/#map=4/36.03/-97.29&layers=O&date=1821-02-22&daterange=1800-01-01,2023-12-31 | my OHM user page https://www.openhistoricalmap.org/user/jeffmeyer | OHM Wiki https://wiki.openstreetmap.org/wiki/OpenHistoricalMap mastodon: @@. @.***>| t: @OpenHistMap (old) https://twitter.com/OpenHistMap