OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
17 stars 1 forks source link

Unable to save changesets – 503 error #633

Closed 1ec5 closed 3 months ago

1ec5 commented 10 months ago

Saving any changeset within iD triggers the following HTTP 503 error, and no changeset is saved:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 503 Service Unavailable

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.


Apache/2.4.52 (Ubuntu) Server at www.openhistoricalmap.org Port 80

Errors occurred while trying to save

In Slack and Discord, others have reported 503 errors when saving in JOSM since yesterday, but the changes appear to be uploaded OK for them.

1ec5 commented 10 months ago

The finally upload goes through if I attempt to save a few times.

Rub21 commented 10 months ago

I have not faced this issue before. One of the production containers restarted 24 hours ago and triggered the following logs:

image
❯ k logs production-osm-seed-web-5b9f979fc5-dddp4 --follow
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
autoprefixer: /var/www/app/assets/stylesheets/screen-rtl.r2.scss:1:28656: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
autoprefixer: /var/www/app/assets/stylesheets/screen-ltr.scss:2729:3: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
autoprefixer: /var/www/app/assets/stylesheets/common.scss:2729:3: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
I, [2023-11-10T23:30:16.434564 #28]  INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js
I, [2023-11-10T23:30:16.434889 #28]  INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js.gz
I, [2023-11-10T23:30:16.435121 #28]  INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js.br
I, [2023-11-10T23:30:16.611432 #28]  INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js
I, [2023-11-10T23:30:16.611734 #28]  INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js.gz
I, [2023-11-10T23:30:16.612027 #28]  INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js.br

real    0m32.818s
user    0m41.383s
sys 0m3.827s
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
Ignoring unknown environment variable: CGIMAP_GITSHA
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
Error: ERROR:  could not resize shared memory segment "/PostgreSQL.1105742876" to 16777216 bytes: No space left on device

Caused by: calc_relation_bbox_ways

I have edited too, and I got the HTTP 503 error once. I've tried twice, and the next one saved. It looks like it is because one of the containers has the error, and only one has been saving. I have just restarted the damaged container, and it looks like changes are being saved.

Also, our DB still have plenty of space: Used/Avail = 97G/396G, It is a weird issue about space, when we have enough space.

@batpad. Do you have any idea about this issue?

1ec5 commented 10 months ago

Still seeing this intermittently. Changeset 98,489 wound up with duplicate edits because the API kept coming back with no response or a 503 even though the edits eventually went through.

danrademacher commented 10 months ago

I was able to upload this simple edit just now with no errors or issues in iD, https://www.openhistoricalmap.org/changeset/98560

I wonder if this is an issue with editing in areas with high data density or something? Or is it intermittent without respect to the nature of the edit?

1ec5 commented 10 months ago

I wonder if this is an issue with editing in areas with high data density or something? Or is it intermittent without respect to the nature of the edit?

It’s been intermittent in general. I tried saving very small changesets (a couple ways apiece) in places with both dense and sparse existing data.

Rub21 commented 10 months ago

It seems that someone has been extensively working with object relationships, either uploading or downloading, though it's not yet clear. This activity is likely the cause of the shared memory segment issue related to the calc_relation_bbox_ways function. This function is executed at: https://github.com/zerebubuth/openstreetmap-cgimap/blob/204abe4fbf0c6f95a2d26a70b0ddbf1a2584642d/src/backend/apidb/changeset_upload/relation_updater.cpp#L738-L756. Once the issue occurs, the app becomes unable to save changes.

I have already increased the shared memory size, we will see if that avoid the issue.

1ec5 commented 10 months ago

I was still running into this issue regularly last night with very small changesets that contained zero or one relation.

danrademacher commented 10 months ago

@Rub21 can you look into this more deeply? Based on the range of folks running into this with tiny changesets, I worry that throwing more resources at it won't solve the issue since it didn't the first time.

Latest report on slack from @1ec5:

Sometimes the only thing that’s failing or timing out is the last API request to close the changeset. I’ve gotten into situations where it looks like nothing uploaded so I retry, only to upload duplicate copies of features as part of the same changeset. People are still reporting running into this in Slack and Discord, so it isn’t just me.

You mentioned that a specific action that prompts a specific SQL query might be the root cause:

someone has been extensively working with object relationships, either uploading or downloading,

And I think you or Sanjay suggested getting @bitner to look at that query: https://github.com/zerebubuth/openstreetmap-cgimap/blob/204abe4fbf0c6f95a2d26a70b0ddbf1a2584642d/src/backend/apidb/changeset_upload/relation_updater.cpp#L738-L756

It seems odd, however, that we'd have something like Minh's experience where "the last API request to close the changeset" unless I guess an object relations query from another user is tanking the API just at that moment.

bitner commented 10 months ago

Can you tell me what instance type you are using and what the following settings in the database are?

SHOW effective_cache_size;
SHOW shared_buffers;
SHOW random_page_cost;
SHOW temp_buffers;
SHOW work_mem;
SHOW max_connections;
SHOW maintenance_work_mem;

Can you also show me the result of this, but replacing the "$1" with an array of valid ids that this query might typically use:

SELECT MIN(latitude)  AS minlat,
                       MIN(longitude) AS minlon, 
                       MAX(latitude)  AS maxlat, 
                       MAX(longitude) AS maxlon  
                FROM current_nodes cn
                INNER JOIN current_way_nodes wn
                  ON cn.id = wn.node_id
                INNER JOIN current_ways w
                  ON wn.way_id = w.id
                INNER JOIN current_relation_members
                    ON current_relation_members.member_id = w.id
                INNER JOIN current_relations
                    ON current_relations.id = current_relation_members.relation_id
                 WHERE current_relations.visible = true
                   AND current_relation_members.member_type = 'Way'
                   AND current_relations.id = ANY($1)
1ec5 commented 10 months ago

It seems odd, however, that we'd have something like Minh's experience where "the last API request to close the changeset" unless I guess an object relations query from another user is tanking the API just at that moment.

https://www.openhistoricalmap.org/changeset/98901 is an example of such a changeset. iD got a 503 error when attempting to close the changeset, so it’s still open and iD would pile onto it if I were to hit Upload again. (Changesets automatically close an hour after the last upload.)

Rub21 commented 10 months ago

@bitner , Currently the config is this one 👇

#effective_cache_size = 4GB
shared_buffers = 128MB      
#random_page_cost = 4.0         # same scale as above
#temp_buffers = 8MB         # min 800kB
#work_mem = 4MB             # min 64kB
max_connections = 100           # (change requires restart)
#maintenance_work_mem = 64MB        # min 1MB

Only , shared_buffers and max_connections are enable.

About the function, i have been trying to run it, but the error do not show up. and $1 = Array[relations_ids] , relations_ids are the ids of relations to download or upload.

-- Create a temporary table to store the array
CREATE TEMP TABLE temp_array (id integer);

-- Insert the generated series into the temporary table
INSERT INTO temp_array
SELECT generate_series(1, 1000);

-- Your SELECT query, modified to use the temporary table
SELECT MIN(cn.latitude) AS minlat,
       MIN(cn.longitude) AS minlon,
       MAX(cn.latitude) AS maxlat,
       MAX(cn.longitude) AS maxlon
FROM current_nodes cn
INNER JOIN current_way_nodes wn ON cn.id = wn.node_id
INNER JOIN current_ways w ON wn.way_id = w.id
INNER JOIN current_relation_members ON current_relation_members.member_id = w.id
WHERE current_relation_members.member_type = 'Way'
  AND current_relation_members.relation_id = ANY(SELECT id FROM temp_array);

-- Optionally, drop the temporary table if it's no longer needed
DROP TABLE temp_array;
mmd-osm commented 10 months ago

I'm keeping an eye on this issue, but I suppose there isn't much we can do from cgimap side. It would be good to see an EXPLAIN for the above statement. Maybe Postgresql is trying to run queries in parallel, which is in the end consuming too many resources in your environment. (https://www.postgresql.org/docs/current/how-parallel-query-works.html)

bitner commented 10 months ago

@Rub21 This database instance is just using the stock defaults for Postgres Settings which are horrible. What size database instance is this (particularly the amount of RAM). How many concurrent connections do you expect there to be to the database?

In the query I put above, I forgot to prepend it with EXPLAIN (VERBOSE, ANALYZE, BUFFERS)

You will want to find a set of ids that are typical of the query that is taking you so long - not just any ids.

If your ids are [987, 543,678], then your Explain query would be


EXPLAIN (VERBOSE, ANALYZE, BUFFERS)
SELECT MIN(cn.latitude) AS minlat,
       MIN(cn.longitude) AS minlon,
       MAX(cn.latitude) AS maxlat,
       MAX(cn.longitude) AS maxlon
FROM current_nodes cn
INNER JOIN current_way_nodes wn ON cn.id = wn.node_id
INNER JOIN current_ways w ON wn.way_id = w.id
INNER JOIN current_relation_members ON current_relation_members.member_id = w.id
WHERE current_relation_members.member_type = 'Way'
  AND current_relation_members.relation_id = ANY('{987, 543,678}'::int[]);
bitner commented 10 months ago

As reference, see https://github.com/developmentseed/how/tree/main/dev/postgresql

Try starting with these values for a system with these settings. I've cranked down the effective_cache_size and shared_buffers quite a bit from their normal recommendation since this is running on a machine with a number of other processes. This is assuming a max of 150 connections, but adding some breathing room for things like transactions left idle.

(These could all also be set in the postgresql.conf file)

ALTER SYSTEM SET effective_cache_size = '16GB';
ALTER SYSTEM SET  shared_buffers = '6GB';
ALTER SYSTEM SET  random_page_cost = 1.1;
ALTER SYSTEM SET  temp_buffers = '512MB';
ALTER SYSTEM SET  work_mem = '32MB';
ALTER SYSTEM SET  max_connections = 200;
ALTER SYSTEM SET  maintenance_work_mem = '256MB';

After setting those setting either in the postgresql.conf or using the above statements, you will need to restart the database for them to take effect.

Rub21 commented 10 months ago

@1ec5 @danrademacher @batpad , We updated the configurations of the database with the values that Bitner suggested. I also added shared memory for web and database containers (512Mb each). I am monitoring how this is going; so far, I have not noticed any issues related to the database and space.

1ec5 commented 9 months ago

It’s been going pretty well in general. I think it’s safe to close out this ticket if others are also unblocked. I am still seeing the API choke with 503s and timeouts on changesets that modify more than a couple boundary relations at a time – even blocking other, smaller changesets that don’t touch relations. But I think we’re tracking that in #412.

Edit: The changeset below finally did save. It just took a while to show up in history.

boundaries.osc ```xml ```
1ec5 commented 9 months ago

I’m starting to see these errors again.

1ec5 commented 9 months ago

30 out of 30 attempts to save this changeset result in a 503 error.

park_center.osc

1ec5 commented 9 months ago

I belatedly realized that this issue is compounded by a bug in iD: the initial changeset/create call succeeded in creating changeset 101,346, but subsequent calls to changeset/upload failed with 503 errors. Eventually, the empty changeset closed itself automatically, but iD stubbornly kept trying to changeset/upload with the closed changeset ID. Meanwhile, restarting iD caused it to lose my changes, apparently thinking they had already been uploaded. I ended up loading the changes in JOSM and saving changeset 101,364 there. I could see JOSM also struggling, but it automatically retries uploads as many times as needed.

batpad commented 9 months ago

Discussion on this (same) issue also happening over at https://github.com/OpenHistoricalMap/issues/issues/642

Rub21 commented 9 months ago

I just pushed changes to solve this issue. The main problem was that in some containers, the CGIMAP process had died, which caused the API to respond with 503 errors sometimes, and at other times the container works fine, with a 200 response, Because we have more than one port serving the site. So, we added a new configuration that will evaluate whether Apache and CGIMAP are alive. If not, it will restart the pods, avoiding the 503 issues.

Rub21 commented 3 months ago

This issues has been solved, Closing here!