drolbr / Overpass-API

A database engine to query the OpenStreetMap data.
http://overpass-api.de
GNU Affero General Public License v3.0
716 stars 90 forks source link

Are there ways to reduce SSD wear? #628

Closed jogemu closed 3 years ago

jogemu commented 3 years ago

On a Hyper-V virtual Ubuntu server I was able to observe approximately 1 TB of write cycles per day with the Overpass-API applying updates. At first I suspected that Ubuntu detecting it as a HDD (rotational is true) and consiquentially completly relocating files to prevent defragmentation on the fly was causing the problem, however in an article I read that rotational is unreliable and I was able to verify that Ubuntu was well aware that the drive supports TRIM. I observed similar wear with dynamic and fixed size virtual hard disks.

Increasing the RAM, assuming that the Swap is the problem, didn't have a substantial difference either. The Swap usage stayed at approximately the same low MB size and occasionally spiked. But I have to admit that my RAM is significantly below the 64 GB that would allow to temporarily store the full update in the RAM. However, most of the time the dynamic Memory reduces the RAM to 2 GB. Increases of the memory seem to be linked to big flushes/loads of updates but even with dynamic memory disabled it does not seem to make much of a difference but maybe much more RAM would help.

Can I expect less SSD wear on bare metal machines or is it as high as expected. The only reason I could imagine is that the binary files that store the data have insufficent buffers to allow new data in the room filling curve. Subsequential new data (witch should be at max 100 GB a day) moves big parts of the existing data every time the database extends it size. A simplified explaination would be that a SSD will write the moved old data to a free cell with the least write cycles therefore the file scattered across the SSD anyway and the old cell of the file does not have a different seek time than the new cell.

I would consider that either the binaries or the vhdx file does prevent the SSD from reusing the already existing cell just on a different position in the file. Is there any way to test if SSD wear decreases by increased buffers in the room filling curve?

mmd-osm commented 3 years ago

Fwiw: to my knowledge ssd wear has never been an issue on any of the production overpass servers that are in use since many years. All of them handle millions of requests per day and update data every minute. Not sure about your hyper-v setup, we’re not using any virtualisation technique.

jogemu commented 3 years ago

Thank you for not closing my issue even though I did not respond for more than a week. I had to do some time intensive testing to cross out possible reasons for the SSD wear. Someone I asked observed that Samsung's firmware can need substantially more writes than the OS intended. However, that was not the case because the OS intended to write approximately 42 GB / hour according to Windows Performance Monitor which wasn't significantly more than the firmware writes. I assume that it is no coincidence that 42 GB / hour times 24 hours / day is 1008 GB / day which is almost 1 TB which is what I observed when I took a look at SSD wear.

Writing that for many years should not be an issue for modern server-grade SSDs however I think the SSDs have improved a lot in that timespan and server-grade from years prior might be close to normal SSDs today. My SSD could handle more than 2 years with some improvements through guest services but it still seems strange to me that updates smaller than 100 kB can cause 2 digit GB in writes.

drolbr commented 3 years ago

FWIW: Overpass API writes to its database in large blocks, usually multiples of 128 KB. This may explain why you see a substantial write amplification on the operating system level. For an SSD, this usually happens rather at the firmware level.

jogemu commented 3 years ago

I am sorry to bother you again. I did some more testing and dumpe2fs provided some insights about the self reported lifetime writes of the guests (ubuntu server) ext4 filesystem. Now I know that the high lifetime writes match approximately with what I previously blamed Hyper-V/VHDX for. The file system write behavior should not be influenced by the hypervisor - it may become more actual writes due to inefficiency of the hypervisor but the file system won't sync ext4 properties with the actual writes - therefore I have valid reasons to believe that something in the configuration of the ubuntu server causes the issue.

I would like to compare the dumpe2fs with the production overpass server but I understand that it contains unique identifiers and seeds that you don't share that for a productive server therefore I would be thankful if you can tell me if e.g. block size or filesystem features differs from my dump. I use LVM therefore I used "sudo dumpe2fs /dev/dm-0 > extdump.txt" instead of "/dev/sda" with the right number. Furthermore, the reason why I thought of a productive server was that it will be in use most of the time between "Filesystem created" and "Last mount time" therefore allowing to divide the "Lifetime writes" by days/months rather easily. My server was down most of the 2 months since the file system has been created, therefore I would divide by at max 14 days but more like 9 days as it would match with the 1 TB/day.

Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 39321600 Block count: 157286400 Reserved block count: 7864320 Free blocks: 32660598 Free inodes: 39031387 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Wed Jun 9 00:48:58 2021 Last mount time: Sat Aug 7 22:39:26 2021 Last write time: Sat Aug 7 22:39:25 2021 Mount count: 52 Maximum mount count: -1 Last checked: Wed Jun 9 00:48:58 2021 Check interval: 0 () Lifetime writes: 9 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Journal backup: inode blocks Checksum type: crc32c Journal features: journal_incompat_revoke journal_64bit journal_checksum_v3 Journal size: 1024M Journal length: 262144 Journal sequence: 0x00082dd2

Some information won't update until the unmount of the file system but a server that has run for years will at least have one unmount after some months. The calculation does not need to be recent or accurate. An approximation would be enough.

mmd-osm commented 3 years ago

So even our SSD on the dev instance has written 512 TB in total over the past 6 years, which is roughly 10 GB per hour. This disk wasn't in use all the time, or not always applying minutely diffs, so expect higher values for a productive 24/7 use. Extrapolating from a 10 minute timeframe, I would at least assume 1 TB/day.

OSM is high volume, uncompressed daily diffs are typically in the 1GB range. With the block size mentioned by @drolbr, and a spatially organized database of about 400GB, it's easy to get to 1TB disk writes per day. As people are editing everywhere on the planet every day, you're essentially touching large parts of your database all the time. This is pretty much expected. Consumer grade SSD won't do here.

jogemu commented 3 years ago

Ok, thank you for taking the time to check that. Now, I know that it is not an unrealistic writing stress and within expectations. I appreciate being reassured. Even though I will probably look for a NAS or server SSD next time, I will probably be fine with my consumer grade SSD because it is supposed to last for 720 TBW which would still leave over 200 TB after 6 years use of the dev instance you mentioned.

If I don't update 24/7 but do the updates in bulk e.g. every weekend I can extend the minimum lifetime of 2 years which would degrade the SSD value by less than 10€/month to 10 years or even more. I won't use the server that much anyway and probably only for testing and the price seems cheap enough to don't care if I use it more than expected.

However, maybe you can give me tips to reduce the stress. I know there are hourly, daily, weekly and monthly diff files available too. Is there a way to dynamically switch diff intervall? I would assume that I can use a daily diff file even if the daily starts at 00:00 and my server is already at 03:12 but even if that does not work the minute updates could update until 00:00 then use the daily updates to get to next week start and continue with weekly updates until current week is reached. Then use daily until they reach the point where hourly and then minutely updates are the only way to get to real time.

This might reduce write stress to a level where it won't be better to download the full database to catch up a week of updates instead of doing all minutely updates. 400 GB vs. 1 TB which is kind of crazy but I understand that the workload is that demanding.

mmd-osm commented 3 years ago

hourly, daily, weekly and monthly diff files available too

Diffs are provided as minutely, hourly, and daily diffs. There's weekly changeset (they are not relevant for Overpass) and weekly planet dumps (full data, no diff). I'm not aware of any data on https://planet.openstreetmap.org/ that is being provided on a monthly basis only.

Is there a way to dynamically switch diff intervall?

While it is certainly possible to load minutely, hourly, and daily diffs as appropiate, we don't provide any out of the box scripts supporting this use case.