Am I overloading my local OSRM server? (The OSRM server returned an error: Recv failure: Connection reset by peer)

cds275 commented 4 years ago

I am periodically receiving the following message: "The OSRM server returned an error: Error in function (type, msg, asError = TRUE) : Empty reply from server"

while at other times I receive: "The OSRM server returned an error: Error in function (type, msg, asError = TRUE) : Recv failure: Connection reset by peer"

I believe that I could be getting this from exceeding the HTTP/1.0 request limitation.

Some context: I'm working within a Red Hat Openstack Platform using an instance that runs Ubuntu 20.04 and has 28 cores and 240 GB of RAM. On this instance, I am running the osrm-backend docker image on US OSM data. I am using the Table service to query distance and duration for nearly a billion observations - from within the R environment using the osrm package. All the observations are grouped by the origin coordinate's Census Tract. These groups are placed in separate files.

My current workflow is to:

Run GNU parallel over the Census Tract files (which contain all its respective origin-destination coordinates).
Use the R function mclapply to parallelize my queries by Census Block. I call the OSRM Table service for each Census Block's observations (an average of 1,000 observations per Block).
Combine all Census Blocks within Census Tract and write out the updated file to a mounted directory.

I realize that by the time I actually call the OSRM Table service I have parallelized twice (once for the tract files and once over the blocks). I believe this means that I could be making up to 28*28 = 784 calls at any given moment. According to here the HTTP/1.0 server can handle at maximum 512 requests per connection. At first I thought I needed to increase --max-table-size so I set it to 999999999. No luck. As second idea, I started two more OSRM engines (http://127.0.0.1:5001/ and http://127.0.0.1:5002/) and am having the files go round robin calling a OSRM engine - DISCLAIMER: I am doing this manually and am not currently using the nginx load balancer. Unfortunately, I am still getting the periodic error. I know that I can just keep trying the server until it goes through but it would be nice that have a more elegant solution. How do I know how much each OSRM engine can handle? How should I integrate that with my workflow?

Below are some system and server performance indicators.

top free -m

danpat commented 4 years ago

Couple of notes:

The 512 "requests per connection" limit applies to https://en.wikipedia.org/wiki/Keepalive - one TCP socket is opened to the server, then multiple HTTP request/response can be performed over that socket. It has nothing to do with the --max-table-size, which adjusts the maximum size of a /table query that OSRM will allow you to make before returning a "request too large" error.
1. I don't believe HTTP keepalive is being implemented by your code - https://github.com/rCarto/osrm/blob/master/R/osrmTable.R#L135 only seems to be making a simple HTTP call via RCurl - I do not know whether RCurl pools connections to re-use existing sockets for subsequent requests.

The kind of errors you're getting are likely more operating-system level problems than OSRM issues. My guess is that you're exceeding the operating connection queue by firing off lots of parallel requests faster than OSRM can respond to them. When you do that, some part of the system has to keep track of the backlog of requests (hint: it's the operating system, not OSRM), and eventually buffers fill up, and you start to see things like "connection reset" errors due to the new request queue being full.

The fix is generally to apply some backpressure. What you really want to be doing here is having concurrency, but then have each concurrent worker behave serially - make a request, wait for the response, make another request. Doing this ensures that you have concurrent work, but new requests sit in a queue in your code, where you have control of the work queue.

You should look through your code and make sure the "make request, wait for response, make another request" serialization occurs. What it sounds like is happening is that your code is simply making HTTP requests as fast as it can, hoping the operating system can queue up all the pending TCP connections while OSRM dutifully answers each one as fast as it can. You need to move that queuing into your own code, make sure you only have N requests in-flight at any given time. You can then crank up N until you stop seeing an increase in throughput.

danpat commented 4 years ago

Also I'll note that if your machine has 28 cores - and you're running the client on the same machine, then you should probably think about reducing your parallelism down to closer to the core count - 784 concurrent calls means that each core would be trying to handle 28 concurrent requests. That can lead to a lot of thread context switching, slowing everything down overall, in addition to possibly causing some of the overload symptoms you're seeing. Don't forget that the client code needs processing time as well, so some of the cores will be used for OSRM, some for running your R code.

cds275 commented 4 years ago

Re: " You need to move that queuing into your own code, make sure you only have N requests in-flight at any given time." I see. That makes way more sense. I will give it a try!

Re: "think about reducing your parallelism" Yes, I agree about reducing parallelism. I reduced it to 14 and saw the error rate drop dramatically.

Thanks for the advice.

Project-OSRM / osrm-backend

Am I overloading my local OSRM server? (The OSRM server returned an error: Recv failure: Connection reset by peer) #5775