Planet replication process get stuck

Rub21 commented 11 months ago

Planet replication files have not been generated for a month in production, and there seems to be an issue with the replication of planet files in staging. This is why Overpass cannot complete the import process.

This issue seems to have arisen with changes to the cgimap changesets saves and/or recent updates to the API. I'm not entirely sure,:

Production: Planet replication gets stuck in the replication process without generating the full planet file.
Staging: Planet replication has been generating, but there is a issue with overpass when it is trying to import the planet replication file 👇

❯ k logs staging-osm-seed-overpass-api-0 --previous
No database directory. Initializing
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  830M  100  830M    0     0  31.9M      0  0:00:25  0:00:25 --:--:-- 32.6M
Running preprocessing command: mv /db/planet.osm.bz2 /db/planet.osm.pbf && osmium cat -o /db/planet.osm.bz2 /db/planet.osm.pbf && rm /db/planet.osm.pbf
Reading XML file ... elapsed node 246720441. /app/bin/init_osm3s.sh: line 44:    35 Broken pipe             bunzip2 < $PLANET_FILE
        36 Killed                  | $EXEC_DIR/bin/update_database --db-dir=$DB_DIR/ $META $COMPRESSION
Failed to process planet file

A few weeks ago, we imported a production backup into staging to conduct some performance tests on the database. Perhaps this could be the problem with staging. However, in the case of production, it's still not clear. Maybe it just needs an update to the osmosis version. I have opened a ticket with osm-seed to upgrade the version of osmosis. https://github.com/developmentseed/osm-seed/issues/306

cc. @danrademacher @batpad

mmd-osm commented 11 months ago

Regarding staging planet files: can you make them available for download somewhere, maybe?

I’m suspecting that it might contain some fairly large objects which cause Overpass to fail with a memory allocation error. This „.. killed“ error message would appear under such conditions. A non-empty database directory is also frequently causing issues with initial loads.

It’s easier to get a precise error message for the overpass import when it’s not reading data though a pipe command. Very large object ids could be an issue, but also files which are not properly sorted by node/way/rel and increasing object id. The latter two may not be relevant in your case, since the process worked before, but I would still recommend to validate the planet file using osmium tools to rule out similar issues.

Which Overpass version are you using at this time?

Regarding the stuck osmosis process: have you tried to trigger some stack traces? In osm production we’re using other tools than osmosis, both for planet generation, but also minutely diffs.

batpad commented 11 months ago

Thank you @mmd-osm

So, we seem to have two problems:

On staging, Overpass is not updating / doing a clean import. Am fairly certain we can figure out debugging that - it's also possible that we just have much smaller resource allocations setup for the staging Overpass that we'd now need to bump up. So here for next actions:
- Share a link to a planet dump from staging on this ticket + details about overpass version, etc.
- See if we're able to get better logs / error messages from Overpass
- Confirm we have an empty directory to start / make sure it's nothing weird with our EBS setup
On production, generating full planet is failing. @Rub21 to confirm, on prod there's no issues with minutely diffs or overpass updates, it's just generating full planet using osmosis that's failing? Here, it sounds to me like it might make sense to move to using planet-dump-ng to generate full planet and history. This is probably a bit of work - as I understand it planet-dump-ng does not talk directly to the database but works from a db dump file - which does seem better, but will just involve a bit of changes in how we generate planet. The OSM prod chef config is here.

@mmd-osm do you see any red flags with moving to using planet-ng-dump to create the planet and history dumps? If not, I feel like I prefer going that route than trying to debug or upgrade osmosis.

batpad commented 11 months ago

+cc @geohacker

mmd-osm commented 11 months ago

do you see any red flags with moving to using planet-ng-dump to create the planet and history dumps?

planet-dump-ng used to have some issues with very large relations that happened to have lots of versions (https://github.com/zerebubuth/planet-dump-ng/issues/25). I cannot completely rule out that the way objects are modeled in OHM, some other previously unknown issues with block size calculations might be triggered. I'd recommend to closely monitor planet-dump-ng runs for a while, and report any issues upstream.

Rub21 commented 11 months ago

I think the issue with planet replication in production has been solved. It looks like the process got stuck when the connection to the database was down, and the process got stuck there. Currently, we are accessing the recent planet files, for example: https://s3.amazonaws.com/planet.openhistoricalmap.org/planet/planet-240102_0000.osm.pbf, https://planet.openhistoricalmap.org/?prefix=planet/

danrademacher commented 10 months ago

Noted that the minutely files never stopped, but the daily full Planet replication was failing. Will make a new ticket for alerting on those.

OpenHistoricalMap / issues

Planet replication process get stuck #665