Regression: overpass db update is damaged

vshcherb commented 4 years ago

Since yesterday, overpass (osm-3s_v0.7.55) started to fail like

new_implicit_skeletons: Node 7804405626 used in way 836211560 not found.

https://builder.osmand.net:8080/job/OsmLive_FetchAndUpdateOverpass/lastBuild/console

Cloning new database doesn't help

Full log https://builder.osmand.net:8080/job/OsmLive_FetchAndUpdateOverpass/lastBuild/console

mmd-osm commented 4 years ago

This might be caused by an upstream issue, i.e. incomplete minutely diffs files. Needs more investigation.

Relevant files: https://planet.osm.org/replication/minute/004/146/

[   ] 698.osc.gz               2020-08-11 17:46  179K  
[TXT] 697.state.txt            2020-08-11 17:44  158   
[   ] 697.osc.gz               2020-08-11 17:44  1.2M  
[TXT] 696.state.txt            2020-08-11 17:44  159   
[   ] 696.osc.gz               2020-08-11 17:44  1.1M  
[TXT] 695.state.txt            2020-08-11 17:44  159   
[   ] 695.osc.gz               2020-08-11 17:44  1.3M  
[TXT] 694.state.txt            2020-08-11 17:44  159   
[   ] 694.osc.gz               2020-08-11 17:44  1.1M         
[TXT] 693.state.txt            2020-08-11 17:43  159   
[   ] 693.osc.gz               2020-08-11 17:43  920K     <<<<< replication continued after 1:23 interruption
[TXT] 692.state.txt            2020-08-11 16:20  158   
[   ] 692.osc.gz               2020-08-11 16:20   87K

mmd-osm commented 4 years ago

OK, I have identified the issue: minutely diff file 004/146/693.osc.gz has been written twice (!) by osmosis.

Version 1: 24315 Bytes Aug 11 16:38 693.osc.gz Version 2: 693.osc.gz 2020-08-11 17:43 920K

Overpass picked up the first file, and now lacks most of this minutely diff. Subsequent updates fail, and eventually the update crashes due to missing nodes.

This is not the first time this has happened. For recovery, you need to delete/rename that faulty osc.gz file, and then go back to a previous day clone and start reapplying diffs.

vshcherb commented 4 years ago

Could you please make new database available for clone, so we could at least start by cloning after that changeset

tomhughes commented 4 years ago

The first one never completed properly - the diff generator locked up before it had installed the new state.txt file so after I unblocked things it was regenerated.

The problem is that, unlike other things which consume diffs, overpass apparently just fetches the diffs without checking state.txt to see what the current limit is in the way that other things like pyosmium do.

drolbr commented 4 years ago

I just have made the database of yesterday available at the usual place https://dev.overpass-api.de/clone/ This is the most recent copy I have at the moment.

drolbr commented 4 years ago

@tomhughes First of all, thank you for the quick response. For what should I look in the file as an indicator of failure?

I received https://planet.openstreetmap.org/replication/minute//004/146/693.state.txt at 17:53:00 UTC, and I had the same content as the file has now.

timestamp=2020-08-11T16\:31\:44Z

is pretty credible, and the other data fields are empty or look like PostgreSQL internal fields.

File timestamps are

-rw-rw-r-- 1 roland roland 941887 Aug 11 17:43 693.osc.gz
-rw-rw-r-- 1 roland roland  24315 Aug 11 16:38 693.osc.suspect.gz
-rw-rw-r-- 1 roland roland    159 Aug 11 17:43 693.state.txt
-rw-rw-r-- 1 roland roland    159 Aug 11 17:43 693.state.txt.suspect

where suspect are the broken files.

tomhughes commented 4 years ago

Well that timestamp would be the new version - it locked up about 16:20 UTC and restarted about 17:43 UTC. That's shown in your timestamp for suspect 693 version which is much earlier that either state,txt timestamp.

What pyosmium seems to do (and I think osmosis) is to fetch https://planet.openstreetmap.org/replication/minute/state.txt and get the sequence number from it then fetch diffs up to that sequence.

tomhughes commented 4 years ago

What the generation process does you see is to put the NNN.osc.gz file in place, then the NNN.state.txt file (it locked up at that point last night with that file still called 693.state.txt.tmp) and then updates the top level state.txt which becomes the starting point for the next run.

So if the top level one doesn't update then the next run will generate a new version, and by using that as the key you make sure you only fetch things that are fully committed.

vshcherb commented 4 years ago

Thanks for quick response WIll yesterday db version of Overpass manage to update without installing a patch to source code?

mmd-osm commented 4 years ago

@vshcherb: A patch would be needed to correct the overall issue. If you can't wait, be sure to remove the old 693.osc.gz with 24315 bytes first.

Both fetch_osc.sh ( fetch_minute_diff ) and fetch_osc_and_apply.sh ( collect_minute_diffs ) would need some adjustments to take https://planet.openstreetmap.org/replication/minute/state.txt into account, extract the sequenceNumber, and only if it matches the expected value continue with fetching osc.gz and NNN.state.txt files as @tomhughes has outlined earlier on.

The overall process is now documented in https://wiki.openstreetmap.org/w/index.php?title=Planet.osm/diffs&curid=34107&diff=2021319&oldid=1887135

tomhughes commented 4 years ago

Yes and after stracing osmosis to see what it does it actually writes direct to the osc file so fetching that without checking state could in fact get you a partial file if you were unlucky - the state files are written to a tmp name and atomically renamed into place but the osc isn't.

mmd-osm commented 4 years ago

I think the current process would always check the local NNN.state.txt file before fetching an osc.gz file. It just doesn’t check the global state.txt contents first before descending to the actual minutely diff files.

vshcherb commented 4 years ago

I will probably wait for new database dump if it happens tomorrow

tomhughes commented 4 years ago

Well that should be safe - in this case 693.state.txt never existed until after the osc had been regenerated. The steps osmosis does are:

Create NNN.osc.gz
Create NNN.state.txt.tmp
Rename NNN.state.txt.tmp to NNN.state.txt
Create state.txt.tmp at root
Rename state.txt.tmp to state.txt

Yesterday osmosis got stuck between 2 and 3 and after an hour I killed and restarted it, causing a new version of 693 to be generated but only after that was the state.txt for it published.

drolbr commented 4 years ago

A new database dump is now available, currently at https://dev.overpass-api.de/clone//2020-08-13. As always, the entry https://dev.overpass-api.de/clone/latest_dir points to it, and bin/download_clone.sh should pick it.

I made a new relase to ensure that apply_osc_to_db.sh follows Toms advice.

drolbr commented 4 years ago

I think the current process would always check the local NNN.state.txt file before fetching an osc.gz file. It just doesn’t check the global state.txt contents first before descending to the actual minutely diff files.

The old update process (until yesterday) did not process the XXX.osc.gz before the XXX.state.txt gets available but it downloaded it before the XXX.state.txt. This has now been adapted.

mmd-osm commented 4 years ago

I didn't find anything related to https://planet.openstreetmap.org/replication/minute/state.txt in that new release. Did I miss that part? DId you add that additional file in the overall processing?

So even if you downloaded those two files in the correct order, osmosis might still overwrite both, as in the case where the global state.txt file hasn't been updated. This again will result in data loss.

This is in line with what @tomhughes added to the Wiki page:

Fetching diff files

To fetch changes you should first find the current sequence number by fetching the state for the feed which can be found in the following location:

https://planet.openstreetmap.org/replication/[day|hour|minute]/state.txt

The sequence number should then be extracted from the state and all the required diff files up to and including that sequence number can then be fetched using the naming scheme in the previous section.

Under no circumstances should you attempt to just fetch diffs by incrementing the sequence number as incomplete diffs may be present beyond the one identified in the state file.

drolbr / Overpass-API

Regression: overpass db update is damaged #591