amillb / pgMapMatch

map-matching of GPS traces
MIT License
74 stars 20 forks source link

Poor performance scaling with national road networks #14

Open Dolzen opened 4 years ago

Dolzen commented 4 years ago

The method performs very poorly when used on state or national level networks. I've tried to improve the performance to levels which might make it usable but I'm not sure if I'm moving in the right direction, modifying line 223 seemed to help.

cmd = '''SELECT %(streetIdCol)s, %(source)s::integer, %(target)s::integer, %(cost)s::real, %(reverse_cost)s::real, %(km)s::real, %(kmh)s::real %(fwayCols)s FROM %(streetsTable)s''' % self.cmdDict

This could be resolved by using a WHERE geom && ST_BUFFER() clause, or using WITH filter_table AS ... at the moment it just seems to be downloading the entire network.

The dijkstra query is a bit more challenging, as it takes a text string and a temporary table cannot be passed in. This also means you can't pass a GeomFromText bounding box or similiar. The WHERE geom && ST_BUFFER() method would only work with qouteless postgres queries. Any ideas? Would you even expect any improvement in performance from sub sampling the dijkstra query?

amillb commented 4 years ago

Yes, the pgrouting query will not scale well if you use an entire state or national network. It's designed to use a city- or regional-scale network.

You are right that using a WHERE clause is likely to help. However, it would likely be most effective if added to line 659, which is the pgr_dijkstra query.

To overcome the issue that the dijkstra query takes a text string, the best approach is to do an initial query to find the bounding box of your GPS trace. Put those in self.cmdDict as xmin, xmax, ymin, ymax. You'll also need to add the SRID.

Then, you can modify the dijkstra query:

pgr_dijkstra('SELECT %(streetIdCol)s, %(source)s, %(target)s, %(cost)s, %(reverse_cost)s FROM %(streetsTable)s 
WHERE ST_DWithin(%(streetGeomCol)s, St_SetSrid(ST_MakeEnvelope(%(xmin)s, %(ymin)s, %(xmax)s, %(ymax)s), %(streets_srid)s, 10000)',

Where 10000 is 10km if your projection is in meters.

Note that ST_DWithin is almost always more efficient than ST_Buffer.

There is a similar query in line 713.

For most use cases, this approach is not likely to be efficient. The time spent in the WHERE query is likely to outweigh the pgrouting speed increase. But that won't be the case if you have such a large street network.

An alternate approach would be to create a view of the relevant portion of your streets table, and feed that into pgMapMatch.

If you figure out a way to make this restriction only apply where it would be helpful, feel free to submit a pull request.

Dolzen commented 4 years ago

Thanks heaps for geting back to me :) I agree a filter should only apply when it actually improves overall performance, when the gains are so small that they are discounted by the extra sql then there isn't much point.

I don't think creating dynamic views per map matching session is consistent with the design principles for postgresql. I will think about other ways around it. I did see improved performance when using filtered views though.

I've implemented the filters for getPts and the dijkstra query using essentially your stated method, however I'm still not satisfied with the performance. I will continue to investigate optimizing the sql queries and will submit a pull request when I'm happy with the performance and have figured out a way to make the filter dynamic.

amillb commented 4 years ago

I've implemented the filters for getPts and the dijkstra query using essentially your stated method, however I'm still not satisfied with the performance.

How long is it taking to match a trace? pgMapMatch is always going to take much longer than simpler algorithms that are less robust to difficult data.

Also take a look at some of the config settings, particularly maxNodes and max_skip.

I will continue to investigate optimizing the sql queries and will submit a pull request when I'm happy with the performance and have figured out a way to make the filter dynamic.

Thank you!

Dolzen commented 4 years ago

I've had a look at my entire environment and how to best improve performance, I overlooked my lack of indexing on the network tables and efficient use of postgres resources.

Since improving my server config I'm getting pretty good performance, however creating a filtered bounding box can result in network islands and the algorithm failing at the pgr_dijkstra many to many call. Setting the filter to be a bounding box with a 1km buffer around it seems prevent islands in most cases.

I'm also exploring using the algorithm for linestrings with no temporal weights, aka other road networks. I've set the temporal values to zero, but I'm unsure if the script actually uses temporal and topological scores. Can you fill me in a bit on how this line is involved in calculating the best match? self.LL = distLLs + [temporalLLarray.mean(), temporalLLarray.min(), topologicalLLarray.mean(), topologicalLLarray.min()]

I have added some functions which allow pgmapmatch to support generic lists of tuples as input, I might have to re-read the paper to make sure I'm correctly using the match scores to get the best matches.

Feel free to check my fork if it's confusing, this is very much a WIP.

amillb commented 4 years ago

I'm glad you figured out the performance issues and the right bounding box.

Can you fill me in a bit on how this line is involved in calculating the best match? self.LL = distLLs + [temporalLLarray.mean(), temporalLLarray.min(), topologicalLLarray.mean(), topologicalLLarray.min()]

This line comes after the best route (i.e., best match) has been chosen. It stores the log likelihoods for use in other parts of the code (in particular, to calculate the match score). It's a list of various summary statistics (e.g. means, minima) that are inputs to the match score function.

If you want to set all the temporal values to zero, then just set self.temporalScores to zero before calling self.viterbi(), which is the function where the temporal and topological scores are used to calculate the best match. Alternatively, you could set temporalLL() to return a zero value or list or array of zeros.

Having said that, if you aren't going to use the temporal information, you might be better off with a simpler algorithm (e.g. graphhopper) that will give you better performance.