Performance regression in 0.7.59

lonvia commented 1 year ago

I've updated the OSMF query server from 0.7.57.1 to 0.7.59. The server is running Overpass without attic and meta but with minutely updates for data and areas. This worked fine for 0.7.57.1. On 0.7.59 we now have trouble keeping up with the updates, see https://prometheus.openstreetmap.org/d/_MGuv5SVk/overpass?orgId=1&refresh=1m.

Note that the server is on spinning disks, so rather sensitive to increased IO. I presume that IO is the problem.

lonvia commented 1 year ago

Quoting @mmd-osm from https://github.com/openstreetmap/chef/commit/87ae5946a0509b5ce8a11b8a9150567ed98e76f3#commitcomment-88396649:

The other significant change was that some index files are now also mmap'ed, rather than read from disk and then cached for the duration of a query (I added a second commit to the list above). While this has some advantages for very short queries (<200ms), the new code apparently changes I/O patterns quite a bit.

The interesting question is if the new IO patterns are just an issue with spinning disks. If so, I'm happy for you to close this issue as a "won't fix".

mmd-osm commented 1 year ago

I forgot one other change in the list, which we've previously discussed by email. ~~The block size for ways has changed from 512KB to 128KB.~~ This had some positive effects on NVMe (about 30% according to Roland). —> not relevant, change happened in earlier release.

After raising the increased I/O as an issue a couple of weeks ago, Roland did an analysis on both productive instances. The most likely reason seemed frequent "Pokemon" queries, with heavy attic activity. The other metrics seemed to be somewhat in line with the increased throughput.

Now that gorwen doesn't have attic data and runs completely different queries, it wasn't immediately clear if this would cause some issues for instances running on hard disks. I was a bit skeptical, but didn't have any hard data to proof my point.

drolbr commented 1 year ago

The change in the block size is something that one needs to actively trigger. So I would be surprised if this comes into play here. With a command like

for i in $DB_DIR/*.bin.idx; do { echo $i; hexdump -C <$i | head -n 1; }; done

you can check what block size is used for the database: it is 2^s where is the sum of the fifth and sixth byte (10 03 in hex makes a sum of 19, thus 2^19 = 512k is the block size). It is 512k on z.overpass-api.de opposed to 128k for nodes.bin and ways.bin on lz4.overpass-api.de.

Unfortunately, also the code concerned with writing files has not changed, so it is a surprise if updates are significantly slower. On both lz4 and z the update processes are idle for about 20000 to 50000 seconds per day, so there is no significant change for update speed on the two NVMe machines.

I leave this issue open for the moment being, and I will try to rerun a back-to-back test on a HDD machine when I have the time to do so (e.g. on dev.overpass-api.de).

mmd-osm commented 1 year ago

The change in the block size is something that one needs to actively trigger.

I think, the database was reloaded from scratch using 0.7.59. The previous 0.7.57 database was facing some inconsistent index errors when 0.7.59 was deployed initially.

so it is a surprise if updates are significantly slower.

This could be a side effect of an I/O system which is already near 100% utilization. Presumably this is caused by read activities, which then also impact the minutely updates at one point.

--> https://munin.openstreetmap.org/openstreetmap.org/gorwen.openstreetmap.org/diskstats_utilization/index.html

lonvia commented 1 year ago

Yes, I've recreated the database (which also felt like it was taking much longer but I have no numbers to back up that claim). If I read the hexdumps right, then the block size is 512k:

lonvia@gorwen:/srv/query.openstreetmap.org/db$ for i in *.bin.idx; do { echo $i; hexdump -C <$i | head -n 1; }; done
area_blocks.bin.idx
00000000  88 1d 00 00 10 03 02 00  39 25 00 00 03 00 00 00  |........9%......|
areas.bin.idx
00000000  88 1d 00 00 12 03 02 00  7d 00 00 00 01 00 00 00  |........}.......|
area_tags_global.bin.idx
00000000  88 1d 00 00 10 03 02 00  7e 7c 01 00 04 00 00 00  |........~|......|
area_tags_local.bin.idx
00000000  88 1d 00 00 0f 03 02 00  a3 0c 00 00 01 00 00 00  |................|
node_keys.bin.idx
00000000  88 1d 00 00 10 03 00 00  08 00 00 00 05 00 00 00  |................|
nodes.bin.idx
00000000  88 1d 00 00 0e 03 00 00  2f 04 00 00 08 00 00 00  |......../.......|
node_tags_global.bin.idx
00000000  88 1d 00 00 0e 03 00 00  f8 02 00 00 02 00 00 00  |................|
node_tags_local.bin.idx
00000000  88 1d 00 00 0e 03 00 00  00 00 00 00 08 00 00 00  |................|
relation_keys.bin.idx
00000000  88 1d 00 00 10 03 00 00  00 00 00 00 07 00 00 00  |................|
relation_roles.bin.idx
00000000  88 1d 00 00 10 03 00 00  00 00 00 00 05 00 00 00  |................|
relations.bin.idx
00000000  88 1d 00 00 10 03 00 00  64 1d 00 00 01 00 00 00  |........d.......|
relation_tags_global.bin.idx
00000000  88 1d 00 00 0e 03 00 00  da 04 00 00 07 00 00 00  |................|
relation_tags_local.bin.idx
00000000  88 1d 00 00 0e 03 00 00  a7 02 00 00 05 00 00 00  |................|
way_keys.bin.idx
00000000  88 1d 00 00 10 03 00 00  08 00 00 00 05 00 00 00  |................|
ways.bin.idx
00000000  88 1d 00 00 0e 03 00 00  7b 03 00 00 08 00 00 00  |........{.......|
way_tags_global.bin.idx
00000000  88 1d 00 00 0e 03 00 00  ec 11 00 00 07 00 00 00  |................|
way_tags_local.bin.idx
00000000  88 1d 00 00 0e 03 00 00  c0 01 00 00 08 00 00 00  |................|

Note that it's not necessarily the updates which are at fault. We were already close to IO max capacity before the reimport. If queries are a bit more IO intensive, updates would start to suffer.

mmd-osm commented 1 year ago

Block sizes differ a bit by file type. The relevant change here should be for ways.bin.idx file, which looks like it's 14 + 3 --> 2^17 = 128k ...

This should be in line with the following line in the code: https://github.com/drolbr/Overpass-API/blob/osm3s-v0.7.59/src/overpass_api/core/settings.cc#L130

WAYS(new OSM_File_Properties< Uint31_Index >("ways", *1281024*, 2561024)),

I don't recall when exactly gorwen was loaded the last time (again, from scratch), but most likely we used 128k back then, as the change to the block size happened back in 2019.

drolbr commented 1 year ago

Thank you for the data on gorwen. Well, it is expected that 128k blocks on HDD will strain the disk. The size of 512k have been the found optimum a couple of years ago with HDDs.

It is in principle possible to transcode the file on the fly. The process for this is at the moment:

stop updates
copy-recode the file in question to a copy
move the copy in place of the old file
resume updates Currently, the source code suggests block sizes, but file headers have precedence. That means that you need to that only once.

The only problem is the awkward step to change the settings in the source code. If you are interested in transcoding on gorwen then I will prepare a patch release such that you can do that with a switch on the command line of the cloning tool to avoid changing the source code on the live system.

mmd-osm commented 1 year ago

NB: Although I brought up the block size topic, gorwen was running on 0.7.57.1 before, which already had a 128k block size for ways. Every time the Overpass version was updated, the database was loaded again fresh. We're not carrying over the larger block size settings in the idx files from a pre-0.7.57 release.

So the block size can't be the root cause of the performance issues we're seeing now.

What else has changed in 0.7.59? Direct reading from disk using mmap for idx and data comes to mind.

For illustration purposes and because the stats will only be around for a few weeks, I'm posting a few screenshot how the database lag looked like before/after doing the update on Oct 24.

grafik

October 22 + 23 are most relevant here, as both dates clearly show that the restarted area creation doesn't have an impact on db lag.

...and before/after the upgrade in one diagram:

grafik

drolbr commented 1 year ago

The intended result of this issue should be a general patch version that helps gorwen, instructions how to best configure a magnetic disk system or both. The fist step for that is to figure out what is plausible and what not. For this, a little table of what munin has measured. I've chosen month for the sake of simplicity for the older versions that are still visible in munin, and the week averages for 0.7.59. "gorw" is gorwen. Disk usage for z and lz4 is the sum of the two SSDs per each instance:

                        req/s OPS/s MB/s
z    0.7.57.1  Dec 2021  19    500    85
lz4  0.7.57.1  Dec 2021  20    380    60
z    0.7.58    Sep 2022  21    480    80
lz4  0.7.58    Sep 2022  18    530    68
gorw 0.7.57.1  Sep 2022   0.7  150    12
z    0.7.59   last week  22    959   114
lz4  0.7.59   last week  23   1890   210
gorw 0.7.59   last week   0.7  350    37

What we see is that z and lz4 differ largely on 0.7.59, and their main difference is that lz4 has nodes.bin and ways.bin on 128k, everything else is 512k, some files have even bigger blocks. The huge difference does justify the working hypothesis that for smaller blocks the actual disk usage increases. This is expected for the number of accesses per second, but not for the transferred amout of data. To assure that we see the actual disk usage, I've also investigated as good as possible the app's point of view:

ls -l /opt/osm3s/db/*.bin.idx | awk '{ sum += $5; } END { print sum/1024; }'

and a similar command for the lz4 instance shows that the sum of sizes of the index file is about 46 MB on z and 67 MB on lz4. Based on 22 resp. 23 requests per second this results in 1.0 GB/s requested on z and 1.5 GB/s requested on lz4 for the index files. Given that the actual disk throughput is much lower, these files must mostly be served from the cache.

Likewise, a further investigation with

for i in logs/transactions.2022-10-*; do { echo "$(basename $i .log.gz)  $(gunzip <$i | grep -E 'read_finished.. start' | awk '{ split(substr($0,1,200),a," "); sum += a[7]; } END { print sum/86400; }')"; }; done

yields that there are on average 3000 block reads per second for z for both 0.7.58 and 0.7.59, equating to further 1.5 GB/s delivered to the app, apparently again mostly from cache. I have not checked that for 0.7.57.1. For lz4, already on 0.7.58 the average is 3900 block reads per second for unknown reasons; this is about 2.0 GB/s. With the transition to 0.7.59 this changed to 5500 blocks reads per second, for about 30% more processed requests. With the smaller block sizes this is expected to be about 1.4 GB/s.

So the second hypothesis is that there has not been significant change between 0.7.57.1 and 0.7.58, and anything relevant has happened between 0.7.58 and 0.7.59. This is a moderate increase in the amount of data actually read from disk and an increase by a factor 2 of accesses per second.

This allows for the conclusion that the updates are impeded by the read accesses and that there might simply be no root cause in the update code itself. This aligns well with the fact that there have not been material changed to the code of the update routine.

For the reason why we see an a lot worse caching behaviour, I do not have an idea so far. Note that again on the app side there has been little change what data the app sees. My best guess so far is that on read the OS reads the block always in one go while for mmap the OS might read the block in chunks, for whatever reason.

A good next step could be to run the 0.7.58 interpreter binary alongside the rest of the 0.7.59 version. If the hypotheses here are right then this should ease the situation, although for sure not a long term situation. However, it would tell us whether we follow the right trail.

lonvia commented 1 year ago

I've compared old and new database on gorwen and they use exactly the same block sizes and have approximately the same total sizes. So block sizes is not the issue here. For completeness: both DBs use lz4 compression.

It might well be that your changes to the query code have changed the access patterns in a way, that now the active working set of the database exceeds the 96GB of RAM that gorwen has available. With HDDs, performance is highly dependent on Linux being able to hold the right pages in its buffer cache. If the query code now mmaps slightly more from the file than what was read before, then Linux might find itself obliged to drop more pages and reread them from disk. There is an increase in the number of reads on gorwen, while writes are down a bit:

sda-pinpoint=1666370550,1667421930

(Compare before Oct 23rd with after October 30th.)

I'd rather not experiment with running two different versions of overpass at the same time on the production machine, given how that went last time.

And to repeat what I said before: I don't think it's worth optimising overpass for HDDs in this day and age. If more IO does lead to faster queries on SSD/NVME-based systems, then that's fine. I just worry that the performance degradation is present on those systems, too, and you just don't see it because there is enough spare capacity.

drolbr commented 1 year ago

It turned out that with an extra syscall posix_madvise the OS can be directed to perform the same disk caching strategy as before. Therefore, I highly recommend to install 0.7.59.1 over 0.7.59 on gorwen to improve the performance here.

You can safely run the interpreter v0.7.59.1 in combination with the updater and dispatcher v0.7.59. Please do not run the update process for two versions in parallel, and also not multiple dispatchers on the same $DB_DIR.

mmd-osm commented 1 year ago

Thank you for the update! While I think different I/O metrics on lz4.overpass-api.de look much better now, (= very similar to what they used to look like before the upgrade to 0.7.59), I currently don't see an improvement on gorwen. Update to 0.7.59.1 happened at 10:37 UTC today:

Link: https://prometheus.openstreetmap.org/d/_MGuv5SVk/overpass?orgId=1&from=1667396968010&to=1667667675434&viewPanel=6

Link: https://munin.openstreetmap.org/openstreetmap.org/gorwen.openstreetmap.org/overpass_db_lag.html

drolbr commented 1 year ago

Thank you for the quick update. I agree that this does not look as if it had solved the right problem on gorwen. The dent on the disk utilization on gorwen between 10 UTC and 11 UTC is remarkable. There is no significant change with the number of processed requests, and I assume that during that hour the updates had been paused. In this case, there must be an independent problem with update_from_dir on gorwen.

drolbr commented 1 year ago

Could you please extract from the transactions.log as far back as possible the numbers for the read blocks per update? I suggest the command

cat $DB_DIR/transactions.log | grep -E 'write_commit.. start'

This should to a number of lines of the form

2022-11-05 22:47:04 [17385] write_commit() start 61158 43737 4429 36847 2345
2022-11-05 22:47:57 [17423] write_commit() start 59882 40908 1664 36400 2723
2022-11-05 22:49:03 [17474] write_commit() start 62284 43880 4141 37012 2591
2022-11-05 22:50:01 [17525] write_commit() start 35317 36683 5683 23717 7174

and the first number after start is the total number of read blocks. This should give some indication of whether the update process reads abnormally many blocks or a normal amount and is slow for other reasons. The above excerpt is for an all-128-KB-database. During catch-up I had around 260k blocks for 10 minutes, i.e. 600k to 700k for 30 minutes would be expected.

lonvia commented 1 year ago

Here are the greped write-commit lines from the transaction logs on gorwen:

write-commit-log-gowen.log

I don't have the old database from before the 0.7.59 upgrade anymore, so can't go further back for comparison.

And just for completeness: the update delay on Sunday morning is unrelated to the performance problems. We had issues downloading replication diffs there.

drolbr commented 1 year ago

Thank you for the log files.

So from the app's perspective, everything looks fine. During the nights it is able to read up 1000 blocks per second, but during the day this goes down to less than 80 blocks per second. Which still would be reasonable for a hard disk if only few blocks come from cache.

While on gorwen this only has started with version 0.7.59, on dev the same pattern of catching up during night and falling back over the day exists since at least September 2021, and probably earlier. Dev has no read load at all and 512K blocks, making it likely that changing the block size is not going to change the problem. However, on gorwen this pattern clearly emerged not before version 0.7.59.

OTOH the update mechanism had worked quite reliable in the past on older production machines. I do not believe that HDDs got slower since then, and the database design had only slimmed if changed at all. I'm still puzzled.

mmd-osm commented 1 year ago

dev the same pattern of catching up during night and falling back over the day exists since at least September 2021, and probably earlier

I need to double check my log files, but iirc, database lag starts increasing over night on the dev instance. My assumption here is that the clone process is the main reason on the dev instance, as it’s causing a fair amount of i/o load (both read and write) and runs on hard disks. The cloning starts at around midnight and runs for most part of the day till around 5pm. Once cloning has finished, the replication lag is getting smaller again. In any case, numbers on the dev instance are probably heavily skewed by the cloning process.

I would probably either start bisecting, or revert all mmap code and see if this helps.

drolbr commented 1 year ago

I need to double check my log files, but iirc, database lag starts increasing over night on the dev instance. My assumption here is that the clone process is the main reason on the dev instance, as it’s causing a fair amount of i/o load (both read and write) and runs on hard disks.

The updates run fine on dev until about 06:00 or 07:00 UTC and on some days more or less over the day. This does not match the very uniform process of cloning.

lonvia commented 1 year ago

That would indicate that it is really the update process that is to slow. That is also in line with the observation that gorwen sees the same delays on weekends, where read load is quite a bit lower.

I should point out that gorwen does not use the original update scripts from overpass that download and apply one minutely diff after the other. We don't use attic, so gorwen always downloads all available diffs, simplifies them (only keeping the latest version of each object) and applies the result to the database. As a consequence, the size of the diff files that overpass has to process is much larger than on the main overpass instances.

Maybe 0.7.59 has performance issues when handling larger diff files? I could try reducing the maximum diff size that is applied at once. At the moment the maximum allowed file size is 100MB. I'd have to log the actual sizes produced used but my guess is that they don't exceed 20MB.

lonvia commented 1 year ago

I've now added a monitor for the diff size to the Prometheus stats.

mmd-osm commented 1 year ago

The updates run fine on dev until about 06:00 or 07:00 UTC and on some days more or less over the day. This does not match the very uniform process of cloning.

Yes, my recollection wasn't quite accurate. I only have stats every 3 hours for the dev instance, and it looks like replication lag on the hard disk db is ok from about 19:00 UTC to 06:00 UTC the next day. Throughout the day, a lag from 30 minutes to 90 minutes doesn't seem to be uncommon. My thinking was that cloning adds some additional load (I/O + evicting pages from cache) on the dev machine, which isn't present on gorwen.

Gorwen on the other hand frequently shows as similar large replication lag in the 19:00 - 24:00 UTC time window, which isn't the case on the dev box. In that time window, cloning is no longer running on the dev box.

grafik

Note: dev box still runs on Overpass API 0.7.57.1 74a55df1

mmd-osm commented 1 year ago

Maybe 0.7.59 has performance issues when handling larger diff files? I could try reducing the maximum diff size that is applied at once. At the moment the maximum allowed file size is 100MB

On hard disks, you can see a frequent pattern, where the updater first tries to update one minutely diff after the other. With a bit more mapping activity, the updater can't cope with the load anymore fairly soon and starts processing multiple diffs at once. This usually helps to limit the db lag.

Processing the same diffs in smaller chunks often means more I/O: the same blocks are likely to be read and written multiple times. When processing several diffs at once, chances are that changes all happen on the same or similar blocks, which then reduces the total I/O load.

By the way, it's at least technically possible to process 1-3 days worth of data in one go on SSD (as daily diffs). Let's see what happens with the hard disks.

Also, I'm wondering if there's any change to reduce the max disk utilization on gorwen. Maybe we won't see much of a difference, as long as we're on 98% disk util. Can we maybe stop the area processing for a bit to give the system a bit more breathing space?

mmd-osm commented 1 year ago

I've compared old and new database on gorwen and they use exactly the same block sizes and have approximately the same total sizes. So block sizes is not the issue here. For completeness: both DBs use lz4 compression.

Maybe worth noting: I've double checked the index files in https://github.com/drolbr/Overpass-API/issues/680#issuecomment-1299146651, and my impression is that only area files are using lz4 compression, while everything else is uncompressed.

I've put a marker at offset +6, according to the code in https://github.com/drolbr/Overpass-API/blob/osm3s-v0.7.59.1/src/template_db/file_blocks_index.h#L624

area_tags_local.bin.idx
00000000  88 1d 00 00 0f 03 02 00  a3 0c 00 00 01 00 00 00  |................|
                            ^^
                            LZ4_COMPRESSION

node_keys.bin.idx
00000000  88 1d 00 00 10 03 00 00  08 00 00 00 05 00 00 00  |................|
                            ^^
                            NO_COMPRESSION

Values are defined here: https://github.com/drolbr/Overpass-API/blob/osm3s-v0.7.59.1/src/template_db/types.h#L86-L88

I also compared those values with a db on the dev box, which is known to have lz4 both for area and base data. Here I'm seeing the following results:

area_tags_global.bin.idx
00000000  88 1d 00 00 10 03 02 00  9f a1 00 00 03 00 00 00  |................|
area_tags_local.bin.idx
00000000  88 1d 00 00 0f 03 02 00  0c 00 00 00 04 00 00 00  |................|
node_attic_indexes.bin.idx
00000000  88 1d 00 00 0e 03 02 00  cc 02 00 00 04 00 00 00  |................|
node_changepack.bin.idx
00000000  88 1d 00 00 0e 03 02 00  00 00 00 00 05 00 00 00  |................|
node_keys.bin.idx
00000000  88 1d 00 00 10 03 02 00  00 00 00 00 06 00 00 00  |................|
nodes_attic.bin.idx
00000000  88 1d 00 00 0e 03 02 00  00 00 00 00 08 00 00 00  |................|
nodes_attic_undeleted.bin.idx
00000000  88 1d 00 00 0e 03 02 00  00 00 00 00 06 00 00 00  |................|

lonvia commented 1 year ago

Oh dear, you are right. There is an overwrite for the setting on gorwen. I wonder why areas are compressed then.

We probably should try with a reimport and see if that fixes stuff. Is it worth trying with z compression right away? Or can I just convert to be faster?

mmd-osm commented 1 year ago

z compression is somewhat slower and needs more CPU, but the compression rate is better than lz4. I don't have exact numbers how much disk savings will be, as I'm only using lz4 these days.

A reimport will likely take another week. I would first try to create a clone of the current db with lz4 compression enabled:

Option 1: bin/osm3s_query --clone=/path/to/new_lz4_db --clone-compression=lz4 Option 2: bin/osm3s_query --clone=/path/to/new_lz4_db --clone-compression=lz4 --clone-map-compression=lz4

I think both production instances use option 1, for my DBs, I'm usually using option 2.

Both options assume that the dispatcher process is running. Updates can continue while cloning. However, as gorwen has quite some I/O load, I would halt area processing and minutely diff updates while the cloning the db.

lonvia commented 1 year ago

gorwen has plenty of CPU to spare. I'll give the full z compression a try and report back.

lonvia commented 1 year ago

I went with lz4 in the end because libz was not compiled in. That has pretty much solved the issue and gorwen is now able to keep up with minutelies. You can also nicely see that the load on the server has absolutely no influence on the updates.

So the regression is somewhere with the non-compression code. Not sure if it's worth investigating further. Feel free to close as 'won't fix'.

drolbr commented 1 year ago

Thank you for continuing to try until it has been fixed.

I've still no idea why this makes such a difference, so I will keep this one open with low priority to have the handle.

mmd-osm commented 1 year ago

Disk utilization is a bit better now, but we’re still not back at mid July values, when Ubuntu 22.04 was first deployed on gorwen: https://munin.openstreetmap.org/openstreetmap.org/gorwen.openstreetmap.org/diskstats_utilization/sda.html

I’m not exactly sure if the database was uncompressed from July till mid October with 0.7.57 still being used. This would be interesting to know as a reference point.

lonvia commented 1 year ago

The 0.7.57 database was exactly the same: uncompressed except for the area index.

You can't compare the disk utilisation between July and October. Area processing wasn't running then. Only the last two weeks before the 0.7.59 upgrade are comparable and disk use was much higher then than it is now.

mmd-osm commented 1 year ago

Good point. I'm adding another screenshot from Prometheus so it won't get lost in a few weeks time.

Breakdown of the series of events:

Before Oct 22: uncompressed db[^1] running on 0.7.57.1, without area updates -> no major db lag.
October 22 + 23: as before, now with area updates -> no major db lag
October 24 - 29: system unavailable due to db reload
October 30 upgrade to 0.7.59 along with a db reload completed, uncompressed db[^1] , w/ area updates -> major db lag
November 5: upgrade to 0.7.59.1, same db, w/ area updates --> major db lag.
November 13: clone existing database, now using lz4 compression everywhere, w/ area updates-> db lag ok again, even better than before Oct 22.

So the regression is somewhere with the non-compression code.

Exactly, lz4 compression saved the day, 0.7.59 using an uncompressed db won't be able to keep the lag in the range of a few minutes on this hardware anymore.

[^1]: uncompressed db means: non-area files without compression, area files used lz4 compression

drolbr / Overpass-API

Performance regression in 0.7.59 #680