OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
19 stars 1 forks source link

Planet replication process get stuck #665

Closed Rub21 closed 10 months ago

Rub21 commented 11 months ago

Planet replication files have not been generated for a month in production, and there seems to be an issue with the replication of planet files in staging. This is why Overpass cannot complete the import process.

This issue seems to have arisen with changes to the cgimap changesets saves and/or recent updates to the API. I'm not entirely sure,:

❯ k logs staging-osm-seed-overpass-api-0 --previous
No database directory. Initializing
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  830M  100  830M    0     0  31.9M      0  0:00:25  0:00:25 --:--:-- 32.6M
Running preprocessing command: mv /db/planet.osm.bz2 /db/planet.osm.pbf && osmium cat -o /db/planet.osm.bz2 /db/planet.osm.pbf && rm /db/planet.osm.pbf
Reading XML file ... elapsed node 246720441. /app/bin/init_osm3s.sh: line 44:    35 Broken pipe             bunzip2 < $PLANET_FILE
        36 Killed                  | $EXEC_DIR/bin/update_database --db-dir=$DB_DIR/ $META $COMPRESSION
Failed to process planet file

A few weeks ago, we imported a production backup into staging to conduct some performance tests on the database. Perhaps this could be the problem with staging. However, in the case of production, it's still not clear. Maybe it just needs an update to the osmosis version. I have opened a ticket with osm-seed to upgrade the version of osmosis. https://github.com/developmentseed/osm-seed/issues/306

cc. @danrademacher @batpad

mmd-osm commented 11 months ago

Regarding staging planet files: can you make them available for download somewhere, maybe?

I’m suspecting that it might contain some fairly large objects which cause Overpass to fail with a memory allocation error. This „.. killed“ error message would appear under such conditions. A non-empty database directory is also frequently causing issues with initial loads.

It’s easier to get a precise error message for the overpass import when it’s not reading data though a pipe command. Very large object ids could be an issue, but also files which are not properly sorted by node/way/rel and increasing object id. The latter two may not be relevant in your case, since the process worked before, but I would still recommend to validate the planet file using osmium tools to rule out similar issues.

Which Overpass version are you using at this time?

Regarding the stuck osmosis process: have you tried to trigger some stack traces? In osm production we’re using other tools than osmosis, both for planet generation, but also minutely diffs.

batpad commented 11 months ago

Thank you @mmd-osm

So, we seem to have two problems:

batpad commented 11 months ago

+cc @geohacker

mmd-osm commented 11 months ago

do you see any red flags with moving to using planet-ng-dump to create the planet and history dumps?

planet-dump-ng used to have some issues with very large relations that happened to have lots of versions (https://github.com/zerebubuth/planet-dump-ng/issues/25). I cannot completely rule out that the way objects are modeled in OHM, some other previously unknown issues with block size calculations might be triggered. I'd recommend to closely monitor planet-dump-ng runs for a while, and report any issues upstream.

Rub21 commented 11 months ago

I think the issue with planet replication in production has been solved. It looks like the process got stuck when the connection to the database was down, and the process got stuck there. Currently, we are accessing the recent planet files, for example: https://s3.amazonaws.com/planet.openhistoricalmap.org/planet/planet-240102_0000.osm.pbf, https://planet.openhistoricalmap.org/?prefix=planet/

danrademacher commented 10 months ago

Noted that the minutely files never stopped, but the daily full Planet replication was failing. Will make a new ticket for alerting on those.