Closed 1ec5 closed 6 months ago
The finally upload goes through if I attempt to save a few times.
I have not faced this issue before. One of the production containers restarted 24 hours ago and triggered the following logs:
❯ k logs production-osm-seed-web-5b9f979fc5-dddp4 --follow
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
autoprefixer: /var/www/app/assets/stylesheets/screen-rtl.r2.scss:1:28656: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
autoprefixer: /var/www/app/assets/stylesheets/screen-ltr.scss:2729:3: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
autoprefixer: /var/www/app/assets/stylesheets/common.scss:2729:3: Replace color-adjust to print-color-adjust. The color-adjust shorthand is currently deprecated.
I, [2023-11-10T23:30:16.434564 #28] INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js
I, [2023-11-10T23:30:16.434889 #28] INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js.gz
I, [2023-11-10T23:30:16.435121 #28] INFO -- : Writing /var/www/public/assets/application-a8085170a383ebbad7b782225b0806a7e3307dbfe314b8dc3e108d40db56d85a.js.br
I, [2023-11-10T23:30:16.611432 #28] INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js
I, [2023-11-10T23:30:16.611734 #28] INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js.gz
I, [2023-11-10T23:30:16.612027 #28] INFO -- : Writing /var/www/public/assets/osm-fa82c3efbb4e1d41e9ef273a25eff94013c1d1d78b78a6935317a5b235e537fd.js.br
real 0m32.818s
user 0m41.383s
sys 0m3.827s
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
Ignoring unknown environment variable: CGIMAP_GITSHA
WARN: advpng none at /usr/bin/advpng (== none) is of unknown version
Error: ERROR: could not resize shared memory segment "/PostgreSQL.1105742876" to 16777216 bytes: No space left on device
Caused by: calc_relation_bbox_ways
I have edited too, and I got the HTTP 503 error
once. I've tried twice, and the next one saved. It looks like it is because one of the containers has the error, and only one has been saving. I have just restarted the damaged container, and it looks like changes are being saved.
Also, our DB still have plenty of space: Used/Avail
= 97G/396G
, It is a weird issue about space, when we have enough space.
@batpad. Do you have any idea about this issue?
Still seeing this intermittently. Changeset 98,489 wound up with duplicate edits because the API kept coming back with no response or a 503 even though the edits eventually went through.
I was able to upload this simple edit just now with no errors or issues in iD, https://www.openhistoricalmap.org/changeset/98560
I wonder if this is an issue with editing in areas with high data density or something? Or is it intermittent without respect to the nature of the edit?
I wonder if this is an issue with editing in areas with high data density or something? Or is it intermittent without respect to the nature of the edit?
It’s been intermittent in general. I tried saving very small changesets (a couple ways apiece) in places with both dense and sparse existing data.
It seems that someone has been extensively working with object relationships, either uploading or downloading, though it's not yet clear. This activity is likely the cause of the shared memory segment issue related to the calc_relation_bbox_ways function. This function is executed at: https://github.com/zerebubuth/openstreetmap-cgimap/blob/204abe4fbf0c6f95a2d26a70b0ddbf1a2584642d/src/backend/apidb/changeset_upload/relation_updater.cpp#L738-L756. Once the issue occurs, the app becomes unable to save changes.
I have already increased the shared memory size, we will see if that avoid the issue.
I was still running into this issue regularly last night with very small changesets that contained zero or one relation.
@Rub21 can you look into this more deeply? Based on the range of folks running into this with tiny changesets, I worry that throwing more resources at it won't solve the issue since it didn't the first time.
Latest report on slack from @1ec5:
Sometimes the only thing that’s failing or timing out is the last API request to close the changeset. I’ve gotten into situations where it looks like nothing uploaded so I retry, only to upload duplicate copies of features as part of the same changeset. People are still reporting running into this in Slack and Discord, so it isn’t just me.
You mentioned that a specific action that prompts a specific SQL query might be the root cause:
someone has been extensively working with object relationships, either uploading or downloading,
And I think you or Sanjay suggested getting @bitner to look at that query: https://github.com/zerebubuth/openstreetmap-cgimap/blob/204abe4fbf0c6f95a2d26a70b0ddbf1a2584642d/src/backend/apidb/changeset_upload/relation_updater.cpp#L738-L756
It seems odd, however, that we'd have something like Minh's experience where "the last API request to close the changeset" unless I guess an object relations query from another user is tanking the API just at that moment.
Can you tell me what instance type you are using and what the following settings in the database are?
SHOW effective_cache_size;
SHOW shared_buffers;
SHOW random_page_cost;
SHOW temp_buffers;
SHOW work_mem;
SHOW max_connections;
SHOW maintenance_work_mem;
Can you also show me the result of this, but replacing the "$1" with an array of valid ids that this query might typically use:
SELECT MIN(latitude) AS minlat,
MIN(longitude) AS minlon,
MAX(latitude) AS maxlat,
MAX(longitude) AS maxlon
FROM current_nodes cn
INNER JOIN current_way_nodes wn
ON cn.id = wn.node_id
INNER JOIN current_ways w
ON wn.way_id = w.id
INNER JOIN current_relation_members
ON current_relation_members.member_id = w.id
INNER JOIN current_relations
ON current_relations.id = current_relation_members.relation_id
WHERE current_relations.visible = true
AND current_relation_members.member_type = 'Way'
AND current_relations.id = ANY($1)
It seems odd, however, that we'd have something like Minh's experience where "the last API request to close the changeset" unless I guess an object relations query from another user is tanking the API just at that moment.
https://www.openhistoricalmap.org/changeset/98901 is an example of such a changeset. iD got a 503 error when attempting to close the changeset, so it’s still open and iD would pile onto it if I were to hit Upload again. (Changesets automatically close an hour after the last upload.)
@bitner , Currently the config is this one 👇
#effective_cache_size = 4GB
shared_buffers = 128MB
#random_page_cost = 4.0 # same scale as above
#temp_buffers = 8MB # min 800kB
#work_mem = 4MB # min 64kB
max_connections = 100 # (change requires restart)
#maintenance_work_mem = 64MB # min 1MB
Only , shared_buffers and max_connections are enable.
About the function, i have been trying to run it, but the error do not show up. and $1 = Array[relations_ids]
, relations_ids
are the ids of relations to download or upload.
-- Create a temporary table to store the array
CREATE TEMP TABLE temp_array (id integer);
-- Insert the generated series into the temporary table
INSERT INTO temp_array
SELECT generate_series(1, 1000);
-- Your SELECT query, modified to use the temporary table
SELECT MIN(cn.latitude) AS minlat,
MIN(cn.longitude) AS minlon,
MAX(cn.latitude) AS maxlat,
MAX(cn.longitude) AS maxlon
FROM current_nodes cn
INNER JOIN current_way_nodes wn ON cn.id = wn.node_id
INNER JOIN current_ways w ON wn.way_id = w.id
INNER JOIN current_relation_members ON current_relation_members.member_id = w.id
WHERE current_relation_members.member_type = 'Way'
AND current_relation_members.relation_id = ANY(SELECT id FROM temp_array);
-- Optionally, drop the temporary table if it's no longer needed
DROP TABLE temp_array;
I'm keeping an eye on this issue, but I suppose there isn't much we can do from cgimap side. It would be good to see an EXPLAIN for the above statement. Maybe Postgresql is trying to run queries in parallel, which is in the end consuming too many resources in your environment. (https://www.postgresql.org/docs/current/how-parallel-query-works.html)
@Rub21 This database instance is just using the stock defaults for Postgres Settings which are horrible. What size database instance is this (particularly the amount of RAM). How many concurrent connections do you expect there to be to the database?
In the query I put above, I forgot to prepend it with EXPLAIN (VERBOSE, ANALYZE, BUFFERS)
You will want to find a set of ids that are typical of the query that is taking you so long - not just any ids.
If your ids are [987, 543,678], then your Explain query would be
EXPLAIN (VERBOSE, ANALYZE, BUFFERS)
SELECT MIN(cn.latitude) AS minlat,
MIN(cn.longitude) AS minlon,
MAX(cn.latitude) AS maxlat,
MAX(cn.longitude) AS maxlon
FROM current_nodes cn
INNER JOIN current_way_nodes wn ON cn.id = wn.node_id
INNER JOIN current_ways w ON wn.way_id = w.id
INNER JOIN current_relation_members ON current_relation_members.member_id = w.id
WHERE current_relation_members.member_type = 'Way'
AND current_relation_members.relation_id = ANY('{987, 543,678}'::int[]);
As reference, see https://github.com/developmentseed/how/tree/main/dev/postgresql
Try starting with these values for a system with these settings. I've cranked down the effective_cache_size and shared_buffers quite a bit from their normal recommendation since this is running on a machine with a number of other processes. This is assuming a max of 150 connections, but adding some breathing room for things like transactions left idle.
(These could all also be set in the postgresql.conf file)
ALTER SYSTEM SET effective_cache_size = '16GB';
ALTER SYSTEM SET shared_buffers = '6GB';
ALTER SYSTEM SET random_page_cost = 1.1;
ALTER SYSTEM SET temp_buffers = '512MB';
ALTER SYSTEM SET work_mem = '32MB';
ALTER SYSTEM SET max_connections = 200;
ALTER SYSTEM SET maintenance_work_mem = '256MB';
After setting those setting either in the postgresql.conf or using the above statements, you will need to restart the database for them to take effect.
@1ec5 @danrademacher @batpad , We updated the configurations of the database with the values that Bitner suggested. I also added shared memory for web and database containers (512Mb each). I am monitoring how this is going; so far, I have not noticed any issues related to the database and space.
It’s been going pretty well in general. I think it’s safe to close out this ticket if others are also unblocked. I am still seeing the API choke with 503s and timeouts on changesets that modify more than a couple boundary relations at a time – even blocking other, smaller changesets that don’t touch relations. But I think we’re tracking that in #412.
Edit: The changeset below finally did save. It just took a while to show up in history.
I’m starting to see these errors again.
30 out of 30 attempts to save this changeset result in a 503 error.
I belatedly realized that this issue is compounded by a bug in iD: the initial changeset/create call succeeded in creating changeset 101,346, but subsequent calls to changeset/upload failed with 503 errors. Eventually, the empty changeset closed itself automatically, but iD stubbornly kept trying to changeset/upload with the closed changeset ID. Meanwhile, restarting iD caused it to lose my changes, apparently thinking they had already been uploaded. I ended up loading the changes in JOSM and saving changeset 101,364 there. I could see JOSM also struggling, but it automatically retries uploads as many times as needed.
Discussion on this (same) issue also happening over at https://github.com/OpenHistoricalMap/issues/issues/642
I just pushed changes to solve this issue. The main problem was that in some containers, the CGIMAP process had died, which caused the API to respond with 503 errors sometimes, and at other times the container works fine, with a 200 response, Because we have more than one port serving the site. So, we added a new configuration that will evaluate whether Apache and CGIMAP are alive. If not, it will restart the pods, avoiding the 503 issues.
This issues has been solved, Closing here!
Saving any changeset within iD triggers the following HTTP 503 error, and no changeset is saved:
In Slack and Discord, others have reported 503 errors when saving in JOSM since yesterday, but the changes appear to be uploaded OK for them.