GIScience / openrouteservice

🌍 The open source route planner api with plenty of features.
https://openrouteservice.org
GNU General Public License v3.0
1.33k stars 379 forks source link

Concurrency issue with RouteExtraInfo #571

Closed samcrawford closed 4 years ago

samcrawford commented 4 years ago

Here's what I did

I am writing an application that leverages ORS, and one of the things the application does is calculate a number of possible routes for the user to take (typically ~10). Running these in series is quite slow, but running them in parallel across multiple cores is very quick.

Here is a snippet of the code I'm using to perform route calculation in parallel:

ExecutorService WORKER_THREAD_POOL = Executors.newFixedThreadPool(8);
CountDownLatch latch = new CountDownLatch(candidateRoutes.coordinates.size());
for (Coordinate[] candidateCoords : candidateRoutes.coordinates) {
    WORKER_THREAD_POOL.submit(() -> {
        try {
            RoutingRequest req = createRequest(candidateCoords);
            RouteResult res = RoutingProfileManager.getInstance().computeRoute(req);
            scoredRouteResults.add(new ScoredRouteResult(req, res));
            latch.countDown();
        } catch (Exception e) {
            e.printStackTrace();
            Thread.currentThread().interrupt();
        }
    });
}
latch.await();

The createRequest function includes this:

req.setExtraInfo(RouteExtraInfoFlag.getFromString("green"));

In the constructor for ScoredRouteResult we call (vastly simplified code):

// Retrieve the distance for the route
routeResult.getSummary().getDistance()

// Retrieve the green-ness of the route
routeResult.getExtraInfo().get(0).getSummary(
    DistanceUnit.Meters,
    routeResult.getSummary().getDistance(),
    false
);

It's the retrieval of the extra green info (concurrently) that triggers the issues.

Here's what I got

If I run the code in series (i.e. just a straight for-loop instead of the parallelised code pasted above), everything works fine, and is consistently fine, but takes a while (~6 seconds).

If I run the code in parallel using the ExecutorService (using the code above), it completes a lot quicker (~1 second), but I see the following issues:

  1. Sometimes everything appears to work successfully. The .getDistance() call for each route matches between the parallel version and the serial version - so I'm happy route calculation is parallelisable. However, the getExtraInfo(...) call to retrieve the greenness of the route returns erratic values. Summing the green values for the same route produces different totals each time with the parallelised version.

  2. Sometimes an exception is thrown, which can be one of the following two:

    java.lang.NullPointerException
        at heigit.ors.routing.RouteExtraInfo.getSummary(RouteExtraInfo.java:95)
        ...
    java.util.ConcurrentModificationException
        at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
        at java.util.ArrayList$Itr.next(ArrayList.java:859)
        at heigit.ors.routing.RouteExtraInfo.getSummary(RouteExtraInfo.java:93)
        ...

This suggests to me there is some concurrency issue going on with the Extra Info Processor, but after a couple of hours debugging, I can't immediately see where. Hence I'm seeking assistance in tracking down this issue.

Here's what I was expecting

I was expecting the results of the parallel and serial execution to be the same, and I wasn't expecting to receive exceptions.

Here's what I think could be improved

I suspect there is a concurrency issue with the handling of ExtraInfo on routes, and I think we should try to make this threadsafe.

samcrawford commented 4 years ago

I've had no further time to look into the code that's causing this, but I plan to debug it further myself tonight.

To satisfy myself that it wasn't just my faulty code that was causing this, I tried to reproduce this against api.openrouteservice.org with a few requests - and I can confirm the same issue is seen there. Sometimes the extras response element is missing or contains erroneous data.

Steps to reproduce:
for i in `seq 1 3`; do 

curl 'https://api.openrouteservice.org/pdirections?api_key=KEY_HIDDEN&attributes=detourfactor%7Cpercentage&coordinates=9.437256,52.113252%7C10.447998,47.465236&elevation=true&extra_info=steepness%7Cwaytype%7Csurface&geometry=true&geometry_format=geojson&instructions=true&instructions_format=html&language=en-US&preference=fastest&profile=foot-walking&units=m' -H 'Accept: application/json, text/plain, */*' -H 'Referer: https://maps.openrouteservice.org/directions?n1=49.777717&n2=8.728638&n3=7&a=52.113252,9.437256,47.465236,10.447998&b=0&c=0&k1=en-US&k2=km' -H 'Origin: https://maps.openrouteservice.org' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' -s -o series_${i}.json

done

List the results...

[sksam@dev1 ors]$ ll | head -4
total 1708
-rw-rw-r-- 1 sksam sksam 592692 Jul 31 12:20 series_1.json
-rw-rw-r-- 1 sksam sksam 558167 Jul 31 12:20 series_2.json
-rw-rw-r-- 1 sksam sksam 592692 Jul 31 12:20 series_3.json

Observe that series_2.json is smaller than the others. Presumably this is because my 2nd API request happened to collide with some other real user running a request.

Now observe that the differences between the files is the contents of the extras node:

# Good
[sksam@dev1 ors]$ grep -o -E "extra.{120}" series_1.json
extras":{"surface":{"values":[[0,93,3],[93,95,2],[95,131,4],[131,165,3],[165,167,14],[167,174,1],[174,184,0],[184,186,3],[186

# Bad - presumably an exception was thrown whilst building the extras, so it's empty here
[sksam@dcs-dev ors]$ grep -o -E "extra.{120}" series_2.json
extras":{},"bbox":[9.327165,47.4637,10.452246,52.114111]}],"bbox":[9.327165,47.4637,10.452246,52.114111],"info":{"attribution

If anyone has any thoughts in the meantime feel free to share them. Otherwise I will share a fix when I find it.

samcrawford commented 4 years ago

I've found the cause of this.

'Extra' information is added into responses using ExtraInfoProcessor, which is injected into GraphHopper's normal processing flow in ORSGraphHopper. This injection is done in a bit of a hacky way (as the comments acknowledge) via the TranslationMap.

The issue is that ORSGraphHopper is a singleton, and the TranslationMap hash map is defined at the class level. This is a problem because the following lines are found within calcPaths(...) and operate on a per-request basis (potentially concurrently):

Translation tr = getTranslationMap().getWithFallBack(locale);
...
if(tr instanceof TranslationMap.ORSTranslationHashMapWithExtendedInfo){
    ((TranslationMap.ORSTranslationHashMapWithExtendedInfo) tr).init(encoder, weighting, request.getPathProcessor());
}

So when you have multiple routes being plotted concurrently, you end up calling init() passing in different encoder, weighting and ExtraInfoProcessor instances to a single shared ORSTranslationHashMapWithExtendedInfo instance. This results in data getting very muddled between simultaneous requests!

I have fixed this in my own local instance by changing:

if(tr instanceof TranslationMap.ORSTranslationHashMapWithExtendedInfo){
    ((TranslationMap.ORSTranslationHashMapWithExtendedInfo) tr).init(encoder, weighting, request.getPathProcessor());
}

to:

if(tr instanceof TranslationMap.ORSTranslationHashMapWithExtendedInfo){
    tr = new TranslationMap.ORSTranslationHashMapWithExtendedInfo(tr.getLocale());
    ((TranslationMap.ORSTranslationHashMapWithExtendedInfo) tr).init(encoder, weighting, request.getPathProcessor());
}

I have tested this by running multiple concurrent routes (using the same code I put in my original report), and I now see the same correct values and no exceptions being thrown.

I will raise a PR for this shortly, but I suspect there might be a more elegant fix possible.

sfendrich commented 4 years ago

Thank you very much for reporting this issue and for the thorough investigation! We are currently in the process of upgrading our core to a newer GraphHopper version. As we have restructured the way how the extra-info is passed, we have to re-evaluate how your fix fits in there.

samcrawford commented 4 years ago

No problem! And looking at the development-v6 branch, I agree that it looks like the changes there would make this change redundant (see https://github.com/GIScience/openrouteservice/blob/development-v6/openrouteservice/src/main/java/heigit/ors/routing/graphhopper/extensions/ORSGraphHopper.java)