Closed danpat closed 5 months ago
Benchmark mmapd data access vs heap - what, if any, penalty is there? How does this change when the file we mmap is on a ramdisk?
Benchmarks revealed a 10% slowdown w.r.t internal memory. We are putting this on ice for the moment.
Did some quick benchmarking on OSX while on the plane this afternoon. Using this test: https://gist.github.com/danpat/67e6ab63836ffbcc4d832e7db509a5b5
On an OSX ramdisk:
RAM access Run1: 21.026866s wall, 20.580000s user + 0.150000s system = 20.730000s CPU (98.6%)
RAM access Run2: 18.135529s wall, 18.070000s user + 0.040000s system = 18.110000s CPU (99.9%)
RAMdisk mmap Run1: 20.520104s wall, 19.460000s user + 0.790000s system = 20.250000s CPU (98.7%)
RAMdisk mmap Run2: 19.265660s wall, 18.490000s user + 0.730000s system = 19.220000s CPU (99.8%)
On the regular OSX filesystem:
RAM access Run1: 17.700162s wall, 17.650000s user + 0.030000s system = 17.680000s CPU (99.9%)
RAM access Run2: 17.893318s wall, 17.820000s user + 0.040000s system = 17.860000s CPU (99.8%)
Disk mmap Run1: 19.178829s wall, 18.200000s user + 0.740000s system = 18.940000s CPU (98.8%)
Disk mmap Run2: 19.359454s wall, 18.440000s user + 0.780000s system = 19.220000s CPU (99.3%)
I'm not quite sure what this is telling me, I suspect I need to run more samples. I played with a few different madvise
values. MADV_RANDOM added about a 25% slowdown to the mmap
calls when enabled, but had no effect on the direct RAM access.
My machine has 16GB of RAM and I have plenty free, so I'm fairly confident that filesystem caching was in full effect and nothing got swapped out. OSX also performs memory compression when things get tight, but I didn't see that kick in either.
/cc @daniel-j-h
I took a look at some logs from my previous tests, and I think I might've been paging some stuff to swap after all. I halved the data size (4GB to 2GB) and shrank the ramdisk a bit.
I also removed std::rand()
and just used i * BIGPRIME % ARRAYSIZE
to access elements during the loop. While I was seeding with std::srand()
and std::rand()
should be consistent when seeded, I'm not 100% clear what's happening under the covers, so I removed it as a possible variable.
Results now look like this: Tests on the ramdisk volume:
RAM access Run1: 11.017670s wall, 10.960000s user + 0.030000s system = 10.990000s CPU (99.7%)
RAM access Run2: 11.398677s wall, 11.330000s user + 0.030000s system = 11.360000s CPU (99.7%)
RAMdisk mmap Run1: 11.630367s wall, 11.240000s user + 0.360000s system = 11.600000s CPU (99.7%)
RAMdisk mmap Run2: 11.878009s wall, 11.480000s user + 0.370000s system = 11.850000s CPU (99.8%)
Tests on the regular filesystem:
RAM access Run1: 11.302447s wall, 11.080000s user + 0.050000s system = 11.130000s CPU (98.5%)
RAM access Run2: 10.781652s wall, 10.730000s user + 0.030000s system = 10.760000s CPU (99.8%)
Disk mmap Run1: 12.049692s wall, 11.430000s user + 0.460000s system = 11.890000s CPU (98.7%)
Disk mmap Run2: 12.164826s wall, 11.710000s user + 0.380000s system = 12.090000s CPU (99.4%)
Overall, ¯_(ツ)_/¯. Seems like mmap
on the regular filesystem on OSX is a bit slower (~10%). On OSX's ramdisk (e.g. diskutil erasevolume HFS+ 'RAM Disk' $(hdiutil attach -nomount ram://8485760)
for a 4GB disk), we do see some speedup that brings it pretty close to direct RAM access.
@daniel-j-h do you have details of how you tested this on Linux?
@danpat can you have a look - you refactored the data facades.
Is this ticket still relevant and actionable?
We could still do this - in fact, things are slowly getting easier as we refactor the I/O handling.
Let's keep this open as a feature request - one day, down the road, somebody might implement it :-) Keeping this history will be useful.
First step towards this was done in #4881. For further gains we would need to mmap every input file separately.
mmap
-ing individual files has been done in https://github.com/Project-OSRM/osrm-backend/pull/5242
Only thing that PR doesn't complete from our original list is:
OSRM currently supports reading data from files into heap memory (
InternalDataFacade
), or pre-loading data into shared memory using IPC shared memory blocks (SharedDataFacade
+osrm-datastore
).We can consolidate the behaviour of both of these by using
mmap
. Instead of reading files into memory explicitly, we should be able tommap
the data files, and immediately begin using them.There are a few changes that need to be made to get us there:
mmap
d data access vs heap - what, if any, penalty is there? How does this change when the file wemmap
is on a ramdisk?mmap
ed and fix them - basically anything inosrm-datastore
(src/storage/storage.cpp
) that isn't just loaded into memory in one big blob. Problem here isvector<bool>
and its proxy behavior; we need a contiguous container we canmemcpy
to.SharedDataFacade
and perform similar.swap
operations againstmmap
ed memory addresses rather thanshm
addresses.mmap
ed files on-the-flymmap
instead of explicitread
disk files for leaf nodes in the StaticRTree to boost performance (coordinate lookups represent the largest part of any given routing query because of the I/O in the rtree).The main goal here is to minimize double-reads of data. In situations where we are constantly cycling out data sets (in the case of traffic updates), we want to minimize I/O and the number of times any single bit of data gets touched. By using
mmap
andtmpfs
, we can emulate the current share-memory behavior, but avoid an extra pass over the data.For normal
osrm-routed
use, we would essentially get lazy-loading of data -osrm-routed
would start up faster, but queries would be slower since pages are loaded from disk on demand until data is touched and lives in the filesystem cache. This initial slowness could be avoided by pre-seeding the data files into the filesystem cache or viaMAP_POPULATE
(Linux 2.5.46+), and this could be done in parallel toosrm-routed
already starting up and answering queries./cc @daniel-j-h @TheMarex