Open TheMarex opened 6 years ago
I've been thinking a little bit about how we could evaluate the cache hit rate for CH and MLD with the least amount of effort. The easiest thing I could come up with is using point-to-point shortest path queries (routing_algorithms::search(..)
functions) to implement a "dummy" table plugin that runs a query for every entry in the matrix. That is useful since the normal point to point search makes it easy to run unpacking on the resulting path. By adding a dummy cache to the unpacking that basically just keeps track of if it has seen a pair of node ids before (and not store actual data), we can easily compute the cache hit/miss rate for a given number of queries.
One problem with having a cache is that we need to invalidate it every time we change the weighting. This is potentially a problem for MLD due to traffic updates every few minutes. So an important factor we need to find out is how many queries it takes on average until the cache is warm (has a desired hit/miss rate).
Braindump on how to implement this:
column_index
and row_index
herephantom_nodes[source_indices[row_index]]
and destination phantom_nodes[destination_indices[column_index]]
directionShortestPath
for each pair.InternalRouteResult::duration()
to return the duration and fill out the entries in the tables.UnpackingStatistics
that stores NodeID, NodeID
tuplesUnpackingStatistics
from here with the current start and target node id.My use case: compute large matricies like 2000x2000, up to 10000x10000. For such matricies MDL is unpracticable. We use CH. Our main choose about using OSRM was about this ability to compute such large matrix quickly. With change you talk about, do you think the speed on such large matrix will be the same ?
For record: for us increase memory usage by adding other values (e.g. distance) aside weight and time is acceptable if the speed stay the same.
@frodrigo I don't think we will drop support for CH any time soon, for the reason you pointed out.
This issue is about switching from a computation of these values during pre-processing to the query phase. If you don't limit the cache size, that should eventually yield a similar performance, since every edge will have an entry in the cache. If there is a penalty and how big it is, is something we would need to find out and possibly make pre-computation still an option, or some sort of cache warming.
Okay, so I ran two sets of requests to test out how a cache would perform: 1) 5 coordinates, 1000 requests, on Berlin 2) 25 coordinates, 1000 requests, on Berlin
Results for Run 1
The numbers look optimistic! Out of a total of 442,623
total edges to unpack, the cache would have 349,508
hits and 93,115
misses after 1000 random requests of 5 coordinates each in Berlin. The cache would "warm up" at around 87,000 lookups.
Results for Run 2
Out of a total of 10,942,902
total edges to unpack, the cache would have 10,758,094
hits and 184,808
misses after 1000 random requests of 25 coordinates each in Berlin. The cache would "warm up" at around 44,000 lookups.
So things to note:
The next thing I'm going to run is random requests on North America, with 5 coordinates to get a better sense of performance for the above proposed cache.
@ghoshkaj Can you also track the number of entries in the cache here? This will be important, because the cache will need to be limited in size.
If the cache can be infinitely large, then eventually all hits will be cache hits :-)
@danpat I ran a couple of requests and the cache size is always equal to the number of misses. I'm using the cache.size() function.
@ghoshkaj that makes sense, and the shape of the misses
line matches what we should expect as the most common edges get cached.
Is it easy for you to regenerate this for, say, 10k or 50k requests? It would be good to confirm that misses
continues to flatten.
@danpat @TheMarex here is what it looks like for 5 coordinates and 10,000 requests for the dataset that is the US continent.
20 coordinates and 10,000 requests for the dataset that is the US continent:
Those are both for CH. However, for MLD, it's a bit different. This is what 5 coordinates, with 100 requests for the dataset of Berlin looks like: which is a bit worrisome because the misses are so much higher than the hits.
Here is what 20 coordinates, 100 requests in Berlin looks like. The hits are eventually catch up to the misses and surpass them.
It looks like the cache benefits CH more than MLD matrix calculations. I'm repeating these for 10k requests and 5k requests on the extract for the US and will post them ASAP.
I tried using smaller cell sizes:
./osrm-partition berlin-latest.osrm --max-cell-sizes=64,512,2048,8192,32768,131072
100 requests, 5 coordinates, within Berlin.
Also fewer but larger cells:
./osrm-partition berlin-latest.osrm --max-cell-sizes=2000,32000,1000000
100 requests, 5 coordinates, within Berlin.
There doesn't seem to significant differences.
Okay I had a mistake in my implementation for finding stats for MLD.
Here is a collection of all of the stats that I have now collected about CH and MLD:
Crosses over at around 4000 accesses
Crosses over at around 10 cache accesses
Crosses over at around 3700 cache accesses
Crosses over at around 5 cache accesses
As @TheMarex expected, MLD performs far better with MLD! Also, these number look really good!
Okay we're running with the implementing a cache for the annotations computation in the PR #4876 by following the steps in the original comment above for CH.
I've been reading through the steps and the code and I think the following is my plan of action. These are in the form of questions, and I would really like to know the answers to numbers 2, 4 and 6. The others I'm simply looking for a confirmation for (unless I've understood the concept incorrectly of course!) @TheMarex would you please 👍 or 👎 and/or explain the bits I've got wrong? 😸
- Don't save the duration in the heap but only the parent node (e.g. here) This means we can remove the special ManyToManyHeap from SearchEngineData which also saves use some memory.
1) To confirm, this would mean that at the end of implementing this cache and method of computing annotations, we would stop storing the durations data in the graph edges themselves (and summing the durations while the shortest path algorithm runs and populating the durations table then) and the way of calculating durations would be also use this cache approach right?
- Add a matrix that stores the NodeID of every middle node for every source/target pair
2) Is this the cache then? Or simply a data stricture to hold the paths that we would then unpack and compute the total durations for within the many_to_many_ch.cpp file/table plugin?
3) Seeking confirmation on this context loading question/documentation: relaxOutgoingEdge() within the table_plugin file refers to the concept of going through all the unexplored edges from a node and updating the shortest path (given shorter weight and duration) to all connected nodes, correct? That's what the code says and is confirmed by this Masters thesis I found on the internet to make sure that's all it was referring to.
Look! Tense and relaxed 😸 :
4) However, within the relaxOutgoingEdges(), there is this function stallAtNode() I'm not quite sure what this does. What does it do? It seems to be looking through all the adjacent edges to a node, checking if reversing the directions is necessary and then what is this part doing? Is stall
a term of art? Similar to how relax
-ing the edges is a term of art?
5) The reversing direction bit in the stallAtNode()
brings me to the concept of reversing the direction, and possible related, this function: backwardRoutingStep() -- so to confirm, this is the backward step of a bidirectional Dijkstra, yes?
- After we have calculated the full weight row here we run over the row, and unpack the path start, middle and middle, target which yields a list of EdgeBasedNode ids, for which we each extract the segment durations, and sum them up.
6) So when you say and unpack the path [start, middle] and [middle, target]
do you mean we go through the middle_nodes matrix in step 3 (of your list of steps)? Or are we actually calling unpack_path manually at this point. I understand that the whole path is available at this point, but I'm not sure how to access it!
7) Over voice you suggested that I use the same function signature as extractRoute(facade, weight, phantom_nodes, unpacked_nodes, unpacked_edges). I see how I can access the facade
, the weight
, maybe even the phantom_nodes
, but I'm unclear on how to access the unpacked_nodes
and unpacked_edges
.
8) To confirm: we talked about using the portion of (extractRoute())[https://github.com/Project-OSRM/osrm-backend/blob/5531cace7fb41972117fcb9b1b2455946b73d90a/include/engine/routing_algorithms/routing_base.hpp#L371] that simply looks for the durations and adds them up and it looks like the actual lifting for it happens in annotatePath(). So my current plan of action is to look at the annotatePath function and figure out which pieces to pick out from there.
I think once I've implemented number 8, I'll have the answers to number 7. However, number 2/6 is pretty puzzling to me still. And I would really like to know the answer to 4 as a side curiosity!
To confirm, this would mean that at the end of implementing this cache and method of computing annotations, we would stop storing the durations data in the graph edges themselves (and summing the durations while the shortest path algorithm runs and populating the durations table then) and the way of calculating durations would be also use this cache approach right?
Yep.
Is this the cache then? Or simply a data stricture to hold the paths that we would then unpack and compute the total durations for within the many_to_many_ch.cpp file/table plugin?
No the cache works the same way as the statistics logger you already implemented on the unpacked path. The node matrix is for storing the middle node, which is the meeting point for the forward and backward search. We need the middle point to extract the packed path. Since we only store the ID of the parent in the heap, new need to go back from the meeting point in forward/backward direction until we hit the source/target.
Seeking confirmation on this context loading question/documentation: relaxOutgoingEdge() within the table_plugin file refers to the concept of going through all the unexplored edges from a node and updating the shortest path (given shorter weight and duration) to all connected nodes, correct?
Exactly.
However, within the relaxOutgoingEdges(), there is this function stallAtNode() I'm not quite sure what this does. What does it do?
It tires to figure out if this node can bet on the fastest path. The precise reason why this works is a little complicated and intimately connected to how Contraction Hierarchies work.
This is the backward step of a bidirectional Dijkstra, yes?
Yep, exactly.
So when you say and unpack the path [start, middle] and [middle, target] do you mean we go through the middle_nodes matrix in step 3 (of your list of steps)?
If we want to compute the entry duration[i][j]
we need to call this function. You might notice the middle_node
entry, we need to provide it with the meeting point of the forward and backward search middle_nodes[i][j]
to extract the packed path, which we then can unpack.
but I'm unclear on how to access the unpacked_nodes and unpacked_edges
We first extract the packed path as mentioned above using the middle
node, then call unpackPath
like here to get the unpacked path. Finally we could use the extractRoute
-like signature to compute the path duration.
So my current plan of action is to look at the annotatePath function and figure out which pieces to pick out from there.
Sounds good. 👍
@TheMarex recording what I understand so far. Okay so: 1) The steps in the first comment are to accomplish the following:
duration
in the graph and calculating it during the searchunpackPath
method
3) We then calculate durations after finishing the backward + forward search steps here
4) To calculate the durations table as denoted here:
// 1) get the packed path from the the above loops
// void retrievePackedPathFromHeap(const SearchEngineData<Algorithm>::QueryHeap &forward_heap,
// const SearchEngineData<Algorithm>::QueryHeap &reverse_heap,
// const NodeID middle_node_id,
// std::vector<NodeID> &packed_path);
// 2) call unpackPath inside this file, in here
// void unpackPath(const DataFacade<Algorithm> &facade,
// BidirectionalIterator packed_path_begin,
// BidirectionalIterator packed_path_end,
// Callback &&callback)
// 3) calculate the duration using the new method that you will write:
// extractDurations(facade, weight, phantom_nodes, unpacked_nodes, unpacked_edges)
A question: retrievePackedPathFromHeap
takes a forwardHeap and a backwardHeap, however, in the current manyTomanySearch() uses one heap for both backward and forward steps. So can I use retrievePackedPathFromSingleHeap() instead?
Voiced with @TheMarex . Turns out original approach to calculate durations is 🙅♂️ because the manyToManySearch() uses one heap for both backward and forward search. This is a problem because we clear the heap out after one shortest path to a target is found, so we can't go back to the heap after all the paths have been found and pick out the packed_path and unpack it. To still be able to use this method we would need one heap per target, and that's inefficient memory-wise. Actually, I'm not sure that this is the exact reason why ^. Will re-write this once I understand it completely.
Alternative Approach that I'll prototype, maybe @TheMarex comes up with a better solution before it's done though:
1) Still implement a cache in the unpackPath method similarly to the dummy cache PR.
2) Still remove storing the duration on the edge.
3) Do the durations calculation while exploring the search space. This happens in relaxOutgoingEdges(). Instead of getting the duration value from the edge const auto edge_duration = data.duration;
, get it from:
This second approach doesn't give us the memory saving from getting rid of the specialized ManyToManyHeap
s though.
@ghoshkaj Sorry about the confusion, the original idea works, I just forgot an important "detail":
retrievePackedPathFromSearchSpace
that take the middle node, the target node and the search space. It iteratively needs to lookup the parent of the middle node, the parent of the parent, and so forth but not using the ManyToManySearchHeap
but the search_space_with_buckets
middle
node we can find the parent node over running this code to extract a list of NodeBuckets
and find the parent with the smallest weight to middle
.start
to middle
we run retrievePackedPathFromSingleHeap
on the forward heapUpdates
I have implemented the cache and I have two version of calculating the durations working at the moment for CH. I have taken measurements and here is how that looks right now:
I have run these measurements for the following dataset and variables:
Compute duration during finding shortest route: 4.5 ms Compute duration after finding shortest route and unpacking the path: 8.2 ms Compute duration after finding shortest route and unpacking the path and caching: 7.37 ms
I am also going to collect the following baseline measurements and improvements:
Here are some request time measurements: They are random requests on the top part of California. The numbers in the table are the average over the 20000 requests. The headings in the table indicate how the durations are being computed.
during shortest-path search | after search while unpacking path with cache | after search while unpacking path (no cache) | |
---|---|---|---|
25 x 25 matrix | 20.76 ms | 78.22 ms | 141.18 ms |
100 x 100 matrix | 97.20 ms | 261.97 ms | 1697.10 ms |
There's a 78.22/20.76 = 3.7 time slowdown with the cache and 141.18/20.76 = 6.8 time slow down without cache for a 25x25 matrix. There's a 261.97/97.2 = 2.7 time slowdown with the cache and 1697.10/97.20 = 17.45 time slow down without cache for a 100x100 matrix.
This means that roundtrip request times at worst case (even with the cache, supposing we never get repeat requests and therefore we have to do the calculations from scratch) is 1697.10ms which is 1.6 seconds for 100x100 matrix! At best case the roundtrip time will be 0.2 s for 100x100 matrix. For 25x25 matrix, best case: 0.078 seconds, worst case: 0.141 seconds.
What the measurements above mean is that we pretty much definitely need a cache. But it would be nice if the worst case scenario (large matrices with no cache) would take less time.
So, with a goal of improving the roundtrip time for the worst case scenario, these are the immediate next steps:
Capturing voice conversation with @TheMarex @oxidase about cache design considerations that need to be addressed before this cache is considered complete.
How to handle different kinds of requests? For example, if there are exlcude flags then the durations will be different (exclude highways)
Data replacement strategy: when there's new data such as new OSM data or traffic data, the cache needs to return the new data and not the old data. The possible solutions here are:
Shared cache vs threadlocal cache: my current implementation is threadlocal as I've been functioning on a proof of concept style approach. But eventually we will have to choose between the two. But how to choose between these two? This depends on the cache size.
@ghoshkaj, @TheMarex I have a question about this change. I know there are people who are also interested in the distances, not only the durations. I read in the comment from 15 feb that one step in the new processing is: implement a new data structure that will hold the duration between the current node and the next node. Would it be possible to also store the distance in the new data structure, making it possible to calculate the total distance belonging to the shortest duration for all items it the matrix?
For fast table queries we need to be able to quickly determine the duration of shortcut (in the case of CH) or overlay (in the case of MLD) edge. This is why we save the weight and duration for every CH edge and have two sets of matrices and an additional value on the base graph for MLD.
All this data is only needed for the table plugin. All other plugins use path unpacking to compute the duration/distance value during the query. Removing this precomputed that could reduce the memory we need considerably: The overall memory consumption for CH could be reduced by 25%, for MLD by 20% (approximately, we might run into problems with bit packing for the forward/backward flags).
To make this work we need to change the many to many algorithm in the following way:
1) Don't save the
duration
in the heap but only the parent node (e.g. here) This means we can remove the specialManyToManyHeap
from SearchEngineData which also saves use some memory. 2) Store the parent node in theSearchSpaceWithBuckets
here 3) Add a matrix that stores theNodeID
of every middle node for every source/target pair 4) Set the middle node whenever we update theweight_table
here 5) After we have calculated the full weight row here we run over the row, and unpack the pathstart, middle
andmiddle, target
which yields a list ofEdgeBasedNode
ids, for which we each extract the segment durations, and sum them up.The performance crucial step is the last one: We need to unpack a path for every entry in the matrix and calculate the duration value from the segments of the path. These are a lot of operations that would most likely dominate the running time, which would get us something that is closer to running
n x m
point to point queries. To get back to our original performance characteristics, we would need to implement a cache that would yield amortized constant time for computing the duration ofstart, target
. While recursively unpacking a path, we can check for each edge if there is an entry in the cache, if not we compute an entry and insert it.We could implement a thread local cache that is persistent between queries and based on an
unordered_map
which uses the keystart, target, exclude_index
. This cache just stores the duration of the path, but we could have multiple caches, for example for storing the distance value to enable #1353. This cache needs to be configurable in size, e.g. max500 MB
to actually yield any memory reduction over the precomputation of all annotations. As a result, we need to implement a replacement strategy like LRU for the cache.