cmullaparthi / ibrowse

Erlang HTTP client
Other
516 stars 190 forks source link

Ibrowse Timeouts #118

Closed spc16670 closed 10 years ago

spc16670 commented 10 years ago

I have a webservice that for every HTTP post does three ibrowse HTTP post requests to other webservices, waits for them to return, processes responses and then returns back.

Everything seems to work fine at first, but then after a day or so, especially when the service has to deal with a burst in requests I start getting many ibrowse timeouts.

We are not talking about huge volumes here. The service processes on average 50 req/s. The peaks can go up to 200 req/s. Whenever this happens the timeout issue starts showing up.

Here is some load balancing information when everything works fine:

Server:port | ETS | Num conns | LB Pid

       abc:443                        | 93394 | 45    | <0.795.0>
       def3:443                      | 90936 | 10    | <0.787.0>
       ghi:443                         | 92575 | 20    | <0.792.0>

CONNECTIONS EST: 87 CONNECTIONS TIME_WAIT: 316

And this is when we are experiencing the issue:

Server:port | ETS | Num conns | LB Pid

       abc:443                       | 15348 | 182   | <0.23014.250>
       def:443                        | 90936 | 193   | <0.787.0>
       ghi:443                         | 16143 | 129   | <0.19076.250>

CONNECTIONS ESTABLISHED: 551 CONNECTIONS WAITING: 25

The connections waiting shows that when there is an issue, the service goes to the EST state and then hangs on it too long therefore generating a timeout.

I am using 5000 for both max_session and pipline_size. Perhaps tweaking those values may help. The only way to get back on track is to restart the webservice.

Is anyone aware of any known limitations of ibrowse and timeout related issues?

cmullaparthi commented 10 years ago

Thanks for the bug report.

I've just finished redesigning the pipelining as another user of ibrowse had similar issues. This works much better. I'll send you some patches later today. Would appreciate it if you can try and tell me if it works for you.

W: http://chandrusoft.wordpress.com

On 24 Jul 2014, at 09:40, coolfeature notifications@github.com wrote:

I have a webservice that for every HTTP post does three ibrowse HTTP post requests to other webservices, waits for them to return, processes responses and then returns back.

Everything seems to work fine at first, but then after a day or so, especially when the service has to deal with a burst in requests I start getting many ibrowse timeouts.

We are not talking about huge volumes here. The service processes on average 50 req/s. The peaks can go up to 200 req/s. Whenever this happens the timeout issue starts showing up.

Here is some load balancing information when everything works fine:

Server:port | ETS | Num conns | LB Pid

   abc:443                        | 93394 | 45    | <0.795.0>
   def3:443                      | 90936 | 10    | <0.787.0>
   ghi:443                         | 92575 | 20    | <0.792.0>

CONNECTIONS EST: 87 CONNECTIONS TIME_WAIT: 316

And this is when we are experiencing the issue:

Server:port | ETS | Num conns | LB Pid

   abc:443                       | 15348 | 182   | <0.23014.250>
   def:443                        | 90936 | 193   | <0.787.0>
   ghi:443                         | 16143 | 129   | <0.19076.250>

CONNECTIONS ESTABLISHED: 551 CONNECTIONS WAITING: 25

The connections waiting shows that when there is an issue, the service goes to the EST state and then hangs on it too long therefore generating a timeout.

I am using 5000 for both max_session and pipline_size. Perhaps tweaking those values may help. The only when to get back on track is to restart the webservice.

Is anyone aware of any known limitations of ibrowse and timeout related issues?

— Reply to this email directly or view it on GitHub.

spc16670 commented 10 years ago

This will be much appreciated. I am desperate to find a solution and I hesitate moving to a different http client in case it suffers from the same issue. Unfortunately I will not be able to let you know if the patch worked immediately as the issue keeps reappearing only after some time after the restart. Can you post a link to the issue reported by the other user? I have read some bug reports of the Yokozuna project, do you know if they decided to move away from ibrowse?

cmullaparthi commented 10 years ago

This issue wasn't reported publicly, it was over email. But the fix definitely works as the timeouts were caused by a bottleneck in the load balancing process within ibrowse. You should have a fix to try by close of play today.

cmullaparthi commented 10 years ago

Hi there, can you please try out the version in the new_pipeline branch? It passes all tests locally and does well in load tests too.

cmullaparthi commented 10 years ago

And oh, it is backwards compatible from an API point of view, but you will need to restart ibrowse after loading the patches.

spc16670 commented 10 years ago

That is awesome! Thank you for looking into this. I will clone the updated project first thing tomorrow. I will not be able to feedback immediately though. I will have to wait until the new release using the updated ibrowse ends up in production. I am hoping this will happen within next few days.

Could you possibly explain what the issue was? What did you change and what was causing the problem?

And by the way, very good job on the project in general. We have been running your code in production and before we started getting bigger loads everything worked very smoothly. Thanks again.

cmullaparthi commented 10 years ago

Basically, for every host/port endpoint, there is a single ibrowse process (implemented in ibrowse_lb) which regulates the max_sessions and max_pipeline_size for that end point. The way it was figuring out the best connection was by walking the entire ets table. For a large number of connections this can be slow, and this process built up message queues.

A small tweak meant that all it has to do now is one ETS lookup to figure out the best connection. This has been stress tested by another user under pretty heavy conditions and it worked well, so I'm pretty confident this will work for you too. I'm quite excited by this - it was always nagging me that there was a problem here, but I finally had a reason to fix it :-)

Glad you like ibrowse.

spc16670 commented 10 years ago

Been doing some testing. This is the error I am getting now:

{error,{'EXIT',{{{badmatch,{error,{already_started,<0.2157.0>}}},[{ibrowse,do_get_connection,2,[{file,"src/ibrowse.erl"},{line,919}]},{ibrowse,handle_call,3,[{file,"src/ibrowse.erl"},{line,809}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,580}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_server,call,[<0.2220.0>,{send_req,{{url...

Some requests work fine, for other I am getting the above. I am not 100% sure yet if it is the library or my code. Can you advise?

cmullaparthi commented 10 years ago

Just pushed a fix. Can you please try again.

On 25 July 2014 12:34, coolfeature notifications@github.com wrote:

Been doing some testing. This is the error I am getting now:

{error,{'EXIT',{{{badmatch,{error,{already_started,}}},[{ibrowse,do_get_connection,2,[{file,"src/ibrowse.erl"},{line,919}]},{ibrowse,handle_call,3,[{file,"src/ibrowse.erl"},{line,809}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,580}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_server,call,[,{send_req,{{url...

Some requests work fine, for other I am getting the above. I am not 100% sure yet if it is the library or my code. Can you advise?

— Reply to this email directly or view it on GitHub https://github.com/cmullaparthi/ibrowse/issues/118#issuecomment-50137883 .

spc16670 commented 10 years ago

I have not done any proper testing yet but from what I have seen so far it is looking promising. I will give you an update in next few days. Thank you for providing the fix so fast.

spc16670 commented 10 years ago

BTW, can you advise what would be the best way of getting information about the sizes of pipes per each connection of a domain just to see how load balancing is doing?

cmullaparthi commented 10 years ago

ibrowse:get_metrics(Host::string(), Port::integer()).

On 25 July 2014 15:55, coolfeature notifications@github.com wrote:

BTW, can you advice what would be the best way of getting information about the sizes of pipes per each connection of a domain?

— Reply to this email directly or view it on GitHub https://github.com/cmullaparthi/ibrowse/issues/118#issuecomment-50160777 .

spc16670 commented 10 years ago

I am still testing the patch, I thought in the meantime I would share the printout of get_metrics() which confirms the issue:

[{"abc",443,<0.8187.1001>,71192527543,62}, {"def",443,<0.787.0>,909365,10}, {"ghi",443,<0.9249.1001>,71192347294,32}]

cmullaparthi commented 10 years ago

Thanks. Just to clarify, I'm assuming that this is prior to applying the patch?

On 28 Jul 2014, at 13:11, coolfeature notifications@github.com wrote:

I am still testing the patch, I thought in the meantime I would share the printout of get_metrics() which confirms the issue:

[{"abc",443,,71192527543,62}, {"def",443,,909365,10}, {"ghi",443,,71192347294,32}]

— Reply to this email directly or view it on GitHub.

spc16670 commented 10 years ago

Yes, the get_metrics function of the patched version prints: [{<0.854.0>,{message_queue_len,0},6946860,2,{{0,0},{0,0}}}]

Please note this is test environment.

spc16670 commented 10 years ago

This is still not working. I am getting sth like: {gen_server,call,[<0.920.0>,{spawn_connection,{url,"https://abc","abc",443,undefined,undefined,"/abc",https,hostname},5000,100,{[],true},[]}]}] Then the Ibrowse is unresponsive.

cmullaparthi commented 10 years ago

Yes, this has been fixed in a local release. Will push these changes later tonight.

cmullaparthi commented 10 years ago

Can you try the latest version in the new_pipeline branch please?

spc16670 commented 10 years ago

Thank you for responding to queries so fast. I will not have a chance to continue with testing any time soon. If I do, you will be the first to know.

cmullaparthi commented 10 years ago

Okay, no problem. I'll be closing this issue for now. Please reopen if it still doesn't work for you when you do try it again.