essiene / smpp34

An smpp34 library in Erlang. Built on top of smpp34pdu PDU parsing library
20 stars 11 forks source link

Crash During Sending #4

Closed chakhedik closed 13 years ago

chakhedik commented 13 years ago

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698 * Generic server <0.120.0> terminating * Last message in was {<0.121.0>, {pdu,26,2147483652,0,71038063, {submit_sm_resp,"600931801"}}} \ When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,

Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,

                           <0.122.0>,#Ref<0.0.0.4247>,<0.115.0>}

* Reason for termination == * {normal,{gen_server,call, [<0.117.0>, {deliver,<0.120.0>, {pdu,26,2147483652,0,71038063, {submit_sm_resp,"600931801"}}}]}}

* Reason for termination == * {{normal, {gen_server,call, [<0.117.0>, {deliver,<0.120.0>, {pdu,26,2147483652,0,71038063, {submit_sm_resp,"600931801"}}}]}}, {gen_server,call, [<0.120.0>, {<0.121.0>, {pdu,26,2147483652,0,71038063,{submit_sm_resp,"600931801"}}}]}}

[error] 2011-04-06 15:42:41.708 Error in process <0.128.0> on node 'bulk@192.168.1.110' with exit value: {badarg,[{gen_esme34,transmit_pdu,5},{sms,chilledpush,10}]}

Any idea?

chakhedik commented 13 years ago

Nevermind, this happen because of normal blocking call by gen_server call. I just put a que in gen_server cast in front of it.

essiene commented 13 years ago

On Wed, Apr 6, 2011 at 8:53 AM, chakhedik reply@reply.github.com wrote:

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698 * Generic server <0.120.0> terminating * Last message in was {<0.121.0>,                        {pdu,26,2147483652,0,71038063,                             {submit_sm_resp,"600931801"}}} * When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,                               #Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,                               <0.122.0>,#Ref<0.0.0.4247>,<0.115.0>} * Reason for termination == \ {normal,{gen_server,call,                       [<0.117.0>,                        {deliver,<0.120.0>,                                 {pdu,26,2147483652,0,71038063,                                      {submit_sm_resp,"600931801"}}}]}}

looks like a submit_sm had just suceeded here...

* Reason for termination == * {{normal,        {gen_server,call,            [<0.117.0>,             {deliver,<0.120.0>,                 {pdu,26,2147483652,0,71038063,                     {submit_sm_resp,"600931801"}}}]}},    {gen_server,call,        [<0.120.0>,         {<0.121.0>,          {pdu,26,2147483652,0,71038063,{submit_sm_resp,"600931801"}}}]}}

[error] 2011-04-06 15:42:41.708 Error in process <0.128.0> on node 'bulk@192.168.1.110' with exit value: {badarg,[{gen_esme34,transmit_pdu,5},{sms,chilledpush,10}]}

Looks like there's a badarg somewhere in there... so you have a small snippet of what you're trying to do?

Any idea?

Reply to this email directly or view it on GitHub: https://github.com/essiene/smpp34/issues/4

chakhedik commented 13 years ago

I just put echo_esme:sendsms/3 in the loop for a 5k sms. Something like

do([H|T], Src, Msg) -> echo_esme:sendsms(Src, H, Msg), do(T, Src, msg);

do([], _Src, _Msg) -> ok.

Then modify handle_rx to differentiate the response.

handle_rx(P, St) -> PDU = tuple_to_list(P), case lists:nth(6, PDU) of {submit_sm_resp, MessageId} -> SequenceNumber = lists:nth(5, PDU), smppque:update_outbox(integer_tolist(SequenceNumber), MessageId); -> log4erl:log(info, "RX --> ~p", [P]) end, {noreply, St}.

By the way, there's another error during that time in echo_esme.log :

gen_esme34: Terminating with reason: {timeout, {gen_server,call, [<0.117.0>, {tx,0,74541127, {submit_sm,[],0,0,

and

gen_esme34: Terminating with reason: {function_clause, [{bulk_esme,handle_tx, [{'EXIT', {timeout, {gen_server,call, [<0.118.0>, {send,0,45693033, {submit_sm,[],0,0,

I think that badarg appeared in gen_esme34:transmit_pdu/5 (the new added in dev)...

Or maybe I need to check whether that P is_tuple(P) before doing tuple_to_list(P)...

chakhedik commented 13 years ago

Ok, now I think the main problem is the timeout call. That badarg appeared only when there's another request after gen_esme34 had been terminated. Why don't you use gen_server cast instead of gen_server call?at least there's no timeout issue when there's too many requests. It was meant for to be asynchronous rite?Just a thought...

How about submit_multi? :-)

essiene commented 13 years ago

On Thu, Apr 7, 2011 at 7:02 AM, chakhedik reply@reply.github.com wrote:

Ok, now I think the main problem is the timeout call. That badarg appeared only when there's another request after gen_esme34 had been terminated. Why don't you use gen_server cast instead of gen_server call?at least there's no timeout issue when there's too many requests. It was meant for to be asynchronous rite?Just a thought...

Ahh I see.

Actually, I initially made it a gen_server cast, then during load testing, I noticed that gen_esme34's mailbox could easily get filled up if the throughput to the SMSC wasn't high enough and then gen_esme34 would slow to a crawl and crash. Basically, gen_esme34 was receiving messages faster than it was pushing out to the network.

The alternate design is supposed to apply some kind of flow control so that transmission throughput is limited by how fast an actual transmit happens on the network, and gen_esme34 does not blindly fill up its mailbox and then die unceremoniously.

What I need to actually do is to handle that condition as an actual system limit. Basically, I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'. That way, the caller knows to backoff a bit and then try again when things have calmed down.

btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Reply to this email directly or view it on GitHub: https://github.com/essiene/smpp34/issues/4#comment_967598

chakhedik commented 13 years ago

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

p/s: i'm use smpp34 from dev branch because I need that transmit_pdu/5

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

essiene commented 13 years ago

On Thu, Apr 7, 2011 at 11:27 AM, chakhedik reply@reply.github.com wrote:

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

;)

Will do.. will do... will do... Now that someone else is actually trying to use it apart from me deploying it, this is not top priority. I'll put up some preliminary docs to help get started and then expand it from there.

Reply to this email directly or view it on GitHub: https://github.com/essiene/smpp34/issues/4#comment_968279

essiene commented 13 years ago

On Thu, Apr 7, 2011 at 12:06 PM, Essien Essien essiene@gmail.com wrote:

On Thu, Apr 7, 2011 at 11:27 AM, chakhedik reply@reply.github.com wrote:

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

Been busy coding and load-testing ;)

What I've done is introduce different synchronous (transmit_pdu/2,3) and asynchronous (async_transmit_pdu/2,3) apis for sending PDUs. I've pushed them to dev branch now, along with some other changes even to the underlying smpp34pdu parsing library.

  1. The synchronous api is still limited in throuhgput by the actual transmission on the wire.
  2. I noticed that if you turn off the error_logger, the entire system stays up for longer periods when pushing crazy traffic into it. This of course is because of how error_logger's mailbox grows faster than it can write out to file. Resources hogging up the system becomes a problem, I have a plan to deal with that using the os_mon memsup and cpusup applications or something similar.
  3. The cool new api is the asynchronous api. It introduces a new gen_smpp34 option 'max_async_transmits' (the default value is infinity for backward compatibility). When using the asynchronous api, gen_smpp34 tracks how many actual PDUs have been successfully sent off to the network. If the client is very fast and the unsent PDUs build up to the value of max_async_transmits, the all other new transmits are not attempted, but rather sent as warnings to handle_tx/3

My only problem is all through the time the system is overloaded, it sends a warning for "each" attempted transmission. I'm thinking instead that it should just send "one" warning and then when it finally falls back to a reasonable level it sends an "ok" message, but then handle_tx/3 may not be the proper place to be sending these messages, and i'm wary about introducing another callback yet to gen_smpp34.

If you have the time, can you play around with this api and tell me your gut feeling? I'm leaning more towards introducing a handle_overload/2 and handle_overload_recover/2 which will get messages like:

handle_overload(cpu, St) -> {noreply, St}; handle_overload(memory, St) -> {noreply, St}; handle_overload(transmit_overflow, St) -> {noreply, St};

etc.

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

;)

Will do.. will do... will do... Now that someone else is actually trying to use it apart from me deploying it, this is not top priority. I'll put up some preliminary docs to help get started and then expand it from there.

Start work on this, should have something palatable by Wednesday :)

Reply to this email directly or view it on GitHub: https://github.com/essiene/smpp34/issues/4#comment_968279

essiene commented 13 years ago

On Wed, Apr 6, 2011 at 11:21 AM, Essien Essien essiene@gmail.com wrote:

On Wed, Apr 6, 2011 at 8:53 AM, chakhedik reply@reply.github.com wrote:

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698 * Generic server <0.120.0> terminating * Last message in was {<0.121.0>,                        {pdu,26,2147483652,0,71038063,                             {submit_sm_resp,"600931801"}}} * When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,                               #Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,                               <0.122.0>,#Ref<0.0.0.4247>,<0.115.0>} * Reason for termination == \ {normal,{gen_server,call,                       [<0.117.0>,                        {deliver,<0.120.0>,                                 {pdu,26,2147483652,0,71038063,                                      {submit_sm_resp,"600931801"}}}]}}

I've consistently gotten smpp34_rx crashing when memory runs out. This happens when another rogue system is consuming memory and not releasing it fast enough like the error_logger in the examples. For this case, I have pushed a new branch where I'm testing a new idea:

  1. Catching all timeouts and returning an {error, timeout} tuple back all the way to the tcprx module
  2. When timeouts occur, I suspend network receive and backoff (for now a hard time of 5 seconds is set, this will be configurable before I merge this into dev and eventually into master
  3. After backoff, the receiver keeps trying untill it stops getting the timeout error, then it resumes full network receive.

This work is going on on the 'throttled_network_rx' branch