OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
17 stars 1 forks source link

Errors with Proxy_fcgi and CGImap When Uploading Large Data Sets #783

Open Rub21 opened 1 month ago

Rub21 commented 1 month ago

From: https://developmentseed.slack.com/archives/CACTQ7MU4/p1715276861543709 image

We're experiencing timeout errors (proxy_fcgi:error) with our Apache and cgimap setup when uploading large datasets to our OHM-api web. During large data uploads, the server frequently logs proxy_fcgi:error indicating that the timeout specified has expired. This error seems to occur mainly when the load on the server is high.

App 8444 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[ N 2024-05-10 15:41:05.8218 349/T3 age/Cor/CoreMain.cpp:1146 ]: Checking whether to disconnect long-running connections for process 8536, application /var/www (production)
App 9778 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[Fri May 10 15:46:07.882219 2024] [proxy_fcgi:error] [pid 360:tid 140526506424000] (70007)The timeout specified has expired: [client 10.10.47.154:38824] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:47:19.762287 2024] [proxy_fcgi:error] [pid 359:tid 140526422505152] (70007)The timeout specified has expired: [client 10.10.47.154:43766] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:48:31.410450 2024] [proxy_fcgi:error] [pid 360:tid 140526600832704] (70007)The timeout specified has expired: [client 10.10.47.154:56726] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:49:43.226386 2024] [proxy_fcgi:error] [pid 359:tid 140526453974720] (70007)The timeout specified has expired: [client 10.10.47.154:49158] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:50:54.714215 2024] [proxy_fcgi:error] [pid 360:tid 140526516913856] (70007)The timeout specified has expired: [client 10.10.47.154:56044] AH01075: Error dispatching request to : (polling)
[Fri May 10 15:52:06.750295 2024] [proxy_fcgi:error] [pid 359:tid 140526537893568] (70007)The timeout specified has expired: [client 10.10.47.154:45536] AH01075: Error dispatching request to : (polling)
App 11208 output: WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
[ W 2024-05-10 16:36:17.4959 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29834] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:37:41.8413 349/T7 age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 1-29835] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:02.0024 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29835] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:06.3191 349/T7 age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 1-29836] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
[ W 2024-05-10 16:38:42.0237 349/Tb age/Cor/Con/CheckoutSession.cpp:265 ]: [Client 2-29836] Returning HTTP 503 due to: Request queue full (configured max. size: 100)

Full log in /var/log/apache2/error.log: https://gist.github.com/Rub21/87fd55f82a5c83756cc81fd6385c8daf

cc. @jeffreyameyer @1ec5 @batpad @danrademacher

jeffreyameyer commented 1 month ago

Is fixing this as simple as adjusting the timeout duration or is there an underlying other problem?

Rub21 commented 1 month ago

Is fixing this as simple as adjusting the timeout duration or is there an underlying other problem?

Not really, I have already tested increasing the timeout, but it did not work. I'm still trying to find out where exactly the error is.

@batpad, do you have any hunch about this issue?

Rub21 commented 1 month ago

I've been running many tests, uploading large quantities of objects to the API. When there's a large amount of data to process, a 504 error from the API appears, specifically from CGIMAP. Here JOSM's logs https://gist.github.com/Rub21/1bb46bb004f89fdda787fc96b896f2a3

2024-05-13 10:56:50.809 INFO: Gateway Time-out
2024-05-13 10:56:50.809 INFO: Waiting 10 seconds ...
2024-05-13 10:57:00.852 INFO: OK - trying again.
2024-05-13 10:57:00.852 INFO: Starting retry 2 of 5.
2024-05-13 10:57:00.871 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118027/upload (335 kB) ...
2024-05-13 10:58:01.985 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118027/upload -> HTTP/1.1 504 (1 min 0 s; 160 B)
2024-05-13 10:58:02.003 INFO: Gateway Time-out
2024-05-13 10:58:02.003 INFO: Waiting 10 seconds ...

When the web API and the database are started, everything works well, and the API supports a large amount of data. However, as more data is gradually added, it starts to throw 504 errors, I checked the database, and it seems that the bottleneck is in the number of connections to the database, as there are more and more data to upload the cgimap start trowing the error

I have increased the number of connection once more:

Rub21 commented 1 month ago

The issues is still persisting, currently we are using cgimap version about 10 months ago https://github.com/zerebubuth/openstreetmap-cgimap/tree/5cd3d21bebe9d205828608be4c65bbda8b464308, after that there were a lot changes, I am going to update the version of cgimap, and see how it goes.

mmd-osm commented 1 month ago

So what’s inside CGImap log files?

Rub21 commented 1 month ago

hey @mmd-osm , Thanks for taking a look at this issue. Here are the cgimap and JOSM logs: https://gist.github.com/Rub21/6b66bac8bc6912d8398677f3bad8f7b2. During the last upload, a timeout error appeared, which led to some incomplete ways or relations and heavy loads. e.g 👇 , However, cgimap does not show any unusual errors.



<img width="778" alt="image" src="https://github.com/OpenHistoricalMap/issues/assets/1152236/9bc21164-333b-4fa7-a2d1-78b0704d0a9f">

Same issue in the both version  :
- 10 months ago: https://github.com/zerebubuth/openstreetmap-cgimap/tree/5cd3d21bebe9d205828608be4c65bbda8b464308,
- 1 month ago: https://github.com/zerebubuth/openstreetmap-cgimap/tree/24709f19da0ece205b8c9e8f2e9a556822236b67

I have already added a ProxyTimeout=600, but it is not working because the issue is still there: https://github.com/OpenHistoricalMap/ohm-deploy/blob/cgimap/images/web/config/production.conf#L29.
mmd-osm commented 1 month ago

Thanks for sharing the CGImap log files. From what I can see there, we have some fairly long query times for two queries involving the current_way_nodes table:

[2024-05-14T19:56:28 #311] Started request for map(-77.2450447,-12.1438929,-77.1885681,-12.0937100) from 10.10.47.154
[2024-05-14T19:57:34 #311] Executed prepared statement nodes_from_ways in 65350 ms, returning 1601 rows, 1601 affected rows

Statement:

   SELECT DISTINCT wn.node_id AS id FROM current_way_nodes wn WHERE wn.way_id = ANY($1)
[2024-05-14T20:06:11 #307] Executed prepared statement current_way_nodes_to_history in 55163 ms, returning 0 rows, 18652 affected rows

[2024-05-14T20:07:34 #305] Executed prepared statement current_way_nodes_to_history in 45074 ms, returning 0 rows, 18652 affected rows

Statement:

   INSERT INTO way_nodes (way_id, node_id, version, sequence_id)
       SELECT  way_id, node_id, version, sequence_id 
       FROM current_way_nodes wn
       INNER JOIN current_ways w
       ON wn.way_id = w.id
       WHERE id = ANY($1) )

As a result of these long run times, it seems that JOSM is trying to upload to the same changeset another time, while the first upload is still running. This could happen due to the Gateway timeout, in which case JOSM doesn't know if the data has already been successfully uploaded or not.

By the way, if you happen to have less than 10'000 changes to be uploaded, you could try uploading them all at once, instead of using the chunked upload with a chunk size of 2000. This way, the upload would be either complete or fail altogether, rather than some mixed state, where only parts of the changes have been uploaded.

Here's an example involving process ids 305 and 307:

[2024-05-14T20:05:15 #307] Started request for changeset/upload 118101 from 10.10.47.154
[2024-05-14T20:06:27 #305] Started request for changeset/upload 118101 from 10.10.47.154

--> starts another upload for the same changeset in another process

You can see a bit further down that the second process tries to lock changeset 118101 again, but needs to wait for process 307 to finish first. Process 307 spends most of the time in the statement "current_way_nodes_to_history".

[2024-05-14T20:06:48 #305] Executed prepared statement changeset_current_lock in 21433 ms, returning 1 rows, 1 affected rows

Maybe you could investigate in more detail, why both statements I mentioned above take so much time. I find this rather unusual.

Rub21 commented 1 month ago

Thank you, @mmd-osm. I will check those queries. Also sometimes, a 409-Conflict error is triggered in the cgimap log when trying to upload the changeset again and again, This occurs when attempting to upload 27k objects.https://staging.openhistoricalmap.org/changeset/118113

[2024-05-15T02:04:38 #307] Executed prepared statement get_user_id_pass in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement roles_for_user in 0 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement check_user_blocked in 0 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:04:38 #307] Executed prepared statement is_user_active in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:04:38 #307] Started request for changeset/upload 118113 from 10.10.47.154
[2024-05-15T02:04:38 #307] Executed prepared statement changeset_exists in 6 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement changeset_current_lock in 124835 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement insert_tmp_create_nodes in 3 ms, returning 229 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement lock_current_nodes in 0 ms, returning 229 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement current_nodes_to_history in 6 ms, returning 0 rows, 229 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement current_node_tags_to_history in 1 ms, returning 0 rows, 0 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement calc_node_bbox in 0 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement insert_tmp_create_ways in 26 ms, returning 7651 rows, 7651 affected rows
[2024-05-15T02:06:43 #307] Executed prepared statement lock_current_ways in 16 ms, returning 7651 rows, 7651 affected rows
[2024-05-15T02:06:44 #307] Executed prepared statement lock_future_nodes_in_ways in 228 ms, returning 30226 rows, 30226 affected rows
[2024-05-15T02:06:44 #307] Executed prepared statement insert_new_current_way_tags in 172 ms, returning 0 rows, 18061 affected rows
[2024-05-15T02:06:45 #307] Executed prepared statement insert_new_current_way_nodes in 1071 ms, returning 0 rows, 50813 affected rows
[2024-05-15T02:06:45 #307] Executed prepared statement current_ways_to_history in 112 ms, returning 0 rows, 7651 affected rows
[2024-05-15T02:06:49 #307] Executed prepared statement current_way_tags_to_history in 4102 ms, returning 0 rows, 18061 affected rows
[2024-05-15T02:07:25 #307] Executed prepared statement current_way_nodes_to_history in 35572 ms, returning 0 rows, 50813 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement calc_way_bbox in 26231 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_tmp_create_relations in 3 ms, returning 34 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement lock_current_relations in 0 ms, returning 34 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement lock_future_ways_in_relations in 0 ms, returning 71 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_new_current_relation_tags in 4 ms, returning 0 rows, 163 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement insert_new_current_relation_members in 5 ms, returning 0 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relations_to_history in 6 ms, returning 0 rows, 34 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relation_tags_to_history in 4 ms, returning 0 rows, 163 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement current_relation_members_to_history in 7 ms, returning 0 rows, 71 affected rows
[2024-05-15T02:07:51 #307] Executed prepared statement calc_relation_bbox_nodes in 1 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:10:28 #307] Executed prepared statement calc_relation_bbox_ways in 157574 ms, returning 1 rows, 1 affected rows
[2024-05-15T02:10:28 #307] Returning with http error 409 with reason The changeset 118113 was closed at 2024-05-15 02:04:38 UTC
mmd-osm commented 1 month ago

You can’t upload more than 10k changes in a single changeset, unless you have increased the default value. JOSM should automatically split 27k changes into 3 changesets.

Rub21 commented 1 month ago

Yes, JOSM split the changeset into three chunks. The first two went okay, but the last one threw a 409 error.

Rub21 commented 3 weeks ago

For now, I have updated the configuration, increasing the number of object , https://github.com/OpenHistoricalMap/ohm-website/blob/staging/config/settings.yml#L36-L40, part of adding ProxyTimeout

mmd-osm commented 3 weeks ago

I commented on the ProxyTimeout here as well: https://github.com/OpenHistoricalMap/ohm-website/pull/245#issuecomment-2126722651

From the log above it seems that something is terminating the connection after exactly 60s. I think that's caused by the FastCGI proxy. There are some settings to control the wait time like FcgidIOTimeout (The FastCGI application must begin generating the response within this period of time. Increase this directive as necessary to handle applications which take a relatively long period of time to respond.). In this scenario, CGImap would only start sending data after all database statements have been processed. If this is taking so much time in your environmnt, FastCGI proxy might terminate the connection too early.

Then, like I've mentioend before, it makes sense to look into PostgreSQL response times a bit closer, and do some more tracing / analysis.

2024-05-14 15:09:13.172 INFO: POST https://staging.openhistoricalmap.org/api/0.6/changeset/118102/upload -> HTTP/1.1 504 (1 min 0 s; 160 B)
2024-05-14 15:09:13.177 INFO: Gateway Time-out
batpad commented 2 weeks ago

cc @bitner to potentially get help with your 👀 on some of the postgres stuff here.

bitner commented 1 week ago

At first look, these settings should be updated in the system conf (this is looking at staging which appears to have 8G Ram)

shared_buffers = 2GB # 1/4 system memory effective_cache_size = 6GB # 3/4 system memory maintenance_work_mem = 512MB random_page_cost = 1.1 effective_io_concurrency = 200 work_mem = 20MB

bitner commented 1 week ago

For this query:

     SELECT  way_id, node_id, version, sequence_id 
     FROM current_way_nodes wn
     INNER JOIN current_ways w
     ON wn.way_id = w.id
     WHERE id = ANY($1);

Changing to

     SELECT  way_id, node_id, version, sequence_id 
     FROM current_way_nodes wn
     INNER JOIN current_ways w
     ON wn.way_id = w.id
     WHERE way_id = ANY($1);

seems to make a pretty huge difference on the list of way ids that @Rub21 gave me for slow queries

Rub21 commented 1 week ago

@bitner , here is some expensive queries, in the comment of the gist: https://gist.github.com/Rub21/18e89c26f3148132d1e57c6f438fedb2

bitner commented 1 week ago

For this case:

WITH new_way_nodes(way_id, node_id, sequence_id) AS (
             SELECT * FROM
             UNNEST( CAST($1 AS bigint[]),
                     CAST($2 AS bigint[]),
                     CAST($3 AS bigint[])
                  )
          )
          INSERT INTO current_way_nodes (way_id, node_id, sequence_id)
          SELECT * FROM new_way_nodes

Where are the three arrays coming from? If you could just zip them together and on the client side create the three columns and then use COPY to load that into current_way_nodes rather than dealing with exploding and reassembling the arrays into a table, you could be much more efficient.

bitner commented 1 week ago

It does look like changing the random_page_cost from 4 to 1.1 helps this query quite a bit allowing it to chose to use the index rather than a sequential scan.

select distinct wn.node_id as id from current_way_nodes wn where wn.way_id = ANY($1);
jeffreyameyer commented 4 days ago

@bitner - many thanks for these suggestions. Really appreciated! Next time I'm in MSP, wild rice and walleye on me.

mmd-osm commented 3 hours ago

client side create the three columns and then use COPY to load that into current_way_nodes rather than dealing with exploding and reassembling the arrays into a table, you could be much more efficient.

I tested this approach in the real code but couldn't see any significant performance improvement. For the time being, I'd recommend to tune PostgreSQL settings. We anyway can't reproduce the issues OHM is experiencing in OpenStreetMap production or test systems, even though the data volume is several orders of magnitude larger.