Project-OSRM / osrm-backend

Open Source Routing Machine - C++ backend
http://map.project-osrm.org
BSD 2-Clause "Simplified" License
6.38k stars 3.38k forks source link

Implement mmapDataFacade #1947

Closed danpat closed 5 months ago

danpat commented 8 years ago

OSRM currently supports reading data from files into heap memory (InternalDataFacade), or pre-loading data into shared memory using IPC shared memory blocks (SharedDataFacade+osrm-datastore).

We can consolidate the behaviour of both of these by using mmap. Instead of reading files into memory explicitly, we should be able to mmap the data files, and immediately begin using them.

There are a few changes that need to be made to get us there:

The main goal here is to minimize double-reads of data. In situations where we are constantly cycling out data sets (in the case of traffic updates), we want to minimize I/O and the number of times any single bit of data gets touched. By using mmap and tmpfs, we can emulate the current share-memory behavior, but avoid an extra pass over the data.

For normal osrm-routed use, we would essentially get lazy-loading of data - osrm-routed would start up faster, but queries would be slower since pages are loaded from disk on demand until data is touched and lives in the filesystem cache. This initial slowness could be avoided by pre-seeding the data files into the filesystem cache or via MAP_POPULATE (Linux 2.5.46+), and this could be done in parallel to osrm-routed already starting up and answering queries.

/cc @daniel-j-h @TheMarex

TheMarex commented 8 years ago

Benchmark mmapd data access vs heap - what, if any, penalty is there? How does this change when the file we mmap is on a ramdisk?

Benchmarks revealed a 10% slowdown w.r.t internal memory. We are putting this on ice for the moment.

danpat commented 8 years ago

Did some quick benchmarking on OSX while on the plane this afternoon. Using this test: https://gist.github.com/danpat/67e6ab63836ffbcc4d832e7db509a5b5

On an OSX ramdisk:

RAM access Run1: 21.026866s wall, 20.580000s user + 0.150000s system = 20.730000s CPU (98.6%)
RAM access Run2: 18.135529s wall, 18.070000s user + 0.040000s system = 18.110000s CPU (99.9%)
RAMdisk mmap Run1: 20.520104s wall, 19.460000s user + 0.790000s system = 20.250000s CPU (98.7%)
RAMdisk mmap Run2: 19.265660s wall, 18.490000s user + 0.730000s system = 19.220000s CPU (99.8%)

On the regular OSX filesystem:

RAM access Run1: 17.700162s wall, 17.650000s user + 0.030000s system = 17.680000s CPU (99.9%)
RAM access Run2: 17.893318s wall, 17.820000s user + 0.040000s system = 17.860000s CPU (99.8%)
Disk mmap Run1: 19.178829s wall, 18.200000s user + 0.740000s system = 18.940000s CPU (98.8%)
Disk mmap Run2: 19.359454s wall, 18.440000s user + 0.780000s system = 19.220000s CPU (99.3%)

I'm not quite sure what this is telling me, I suspect I need to run more samples. I played with a few different madvise values. MADV_RANDOM added about a 25% slowdown to the mmap calls when enabled, but had no effect on the direct RAM access.

My machine has 16GB of RAM and I have plenty free, so I'm fairly confident that filesystem caching was in full effect and nothing got swapped out. OSX also performs memory compression when things get tight, but I didn't see that kick in either.

/cc @daniel-j-h

danpat commented 8 years ago

I took a look at some logs from my previous tests, and I think I might've been paging some stuff to swap after all. I halved the data size (4GB to 2GB) and shrank the ramdisk a bit. I also removed std::rand() and just used i * BIGPRIME % ARRAYSIZE to access elements during the loop. While I was seeding with std::srand() and std::rand() should be consistent when seeded, I'm not 100% clear what's happening under the covers, so I removed it as a possible variable.

Results now look like this: Tests on the ramdisk volume:

RAM access Run1: 11.017670s wall, 10.960000s user + 0.030000s system = 10.990000s CPU (99.7%)
RAM access Run2: 11.398677s wall, 11.330000s user + 0.030000s system = 11.360000s CPU (99.7%)
RAMdisk mmap Run1: 11.630367s wall, 11.240000s user + 0.360000s system = 11.600000s CPU (99.7%)
RAMdisk mmap Run2: 11.878009s wall, 11.480000s user + 0.370000s system = 11.850000s CPU (99.8%)

Tests on the regular filesystem:

RAM access Run1: 11.302447s wall, 11.080000s user + 0.050000s system = 11.130000s CPU (98.5%)
RAM access Run2: 10.781652s wall, 10.730000s user + 0.030000s system = 10.760000s CPU (99.8%)
Disk mmap Run1: 12.049692s wall, 11.430000s user + 0.460000s system = 11.890000s CPU (98.7%)
Disk mmap Run2: 12.164826s wall, 11.710000s user + 0.380000s system = 12.090000s CPU (99.4%)

Overall, ¯_(ツ)_/¯. Seems like mmap on the regular filesystem on OSX is a bit slower (~10%). On OSX's ramdisk (e.g. diskutil erasevolume HFS+ 'RAM Disk' $(hdiutil attach -nomount ram://8485760) for a 4GB disk), we do see some speedup that brings it pretty close to direct RAM access.

@daniel-j-h do you have details of how you tested this on Linux?

daniel-j-h commented 8 years ago

https://github.com/mapbox/tmpfs-mmap-zero-copy

daniel-j-h commented 7 years ago

@danpat can you have a look - you refactored the data facades.

Is this ticket still relevant and actionable?

danpat commented 7 years ago

We could still do this - in fact, things are slowly getting easier as we refactor the I/O handling.

Let's keep this open as a feature request - one day, down the road, somebody might implement it :-) Keeping this history will be useful.

TheMarex commented 6 years ago

First step towards this was done in #4881. For further gains we would need to mmap every input file separately.

danpat commented 6 years ago

mmap-ing individual files has been done in https://github.com/Project-OSRM/osrm-backend/pull/5242

Only thing that PR doesn't complete from our original list is: