If the publisher connection flaps, nginx rtmp needs a reboot to work well

liluxdev commented 12 years ago

How to try to reproduce:

Publish to the stream with Adobe Flash Live Media Encoder with the default auto-reconnect mode enabled
Turn off for 10 seconds the wifi or eth of the encoder machine in order to simulate a little network flap
Issue comes here: Adobe FLME will reconnect to nginx rtmp, but no data packet will be streamed (accepted by nginx) anymore until a nginx reboot.

Please try because maybe is just a problem of mine not a bug, but i need help to understand what happens to try to fix it.

Description of my case:

I'm sending audio stream to nginx rtmp via Adobe Flash Media Live Encoder situed in a machine with DLS connection. The DLS connection is a little unstable these days: it goes down for few seconds two times a day, but i realized that when a publisher connection in not properly closed my nginx rtmp instance becomes unstable, and useless: i've to reboot nginx! Adoble Flash Media Live Encoder succesfull reconnects to nginx rtmp, and it says that is going but no data packet is streamed anymore: i see the encoder connected in stats.xls I have a

<publishing/>

like the working situation, but

<avsync>
number
</avsync>

remains at the last numeric value before the flap until a reboot (of course also real playback don't works anymore) and nginx debug log says following:


2012/09/29 18:23:51 [debug] 19350#0: *88 RTMP recv audio (8) csid=4 timestamp=487506 mlen=262 msid=1 nbufs=1
2012/09/29 18:23:51 [debug] 19350#0: *88 nhandlers: 4
2012/09/29 18:23:51 [debug] 19350#0: *88 calling handler 0
2012/09/29 18:23:51 [debug] 19350#0: *88 record: day frame: mlen=262
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 25, 00007FFFF77A2FA0, 11, 985856
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 25, 0000000000D17974, 262, 985867
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 25, 00007FFFF77A2FA0, 4, 986129
2012/09/29 18:23:51 [debug] 19350#0: *88 record: hour frame: mlen=262
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 26, 00007FFFF77A2FA0, 11, 985856
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 26, 0000000000D17974, 262, 985867
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 26, 00007FFFF77A2FA0, 4, 986129
2012/09/29 18:23:51 [debug] 19350#0: *88 record: five frame: mlen=262
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 27, 00007FFFF77A2FA0, 11, 985856
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 27, 0000000000D17974, 262, 985867
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 27, 00007FFFF77A2FA0, 4, 986129
2012/09/29 18:23:51 [debug] 19350#0: *88 record: min frame: mlen=262
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 11, 00007FFFF77A2FA0, 11, 349587
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 11, 0000000000D17974, 262, 349598
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 11, 00007FFFF77A2FA0, 4, 349860
2012/09/29 18:23:51 [debug] 19350#0: *88 record: default frame: mlen=262
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 29, 00007FFFF77A2FA0, 11, 985856
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 29, 0000000000D17974, 262, 985867
2012/09/29 18:23:51 [debug] 19350#0: *88 write: 29, 00007FFFF77A2FA0, 4, 986129
2012/09/29 18:23:51 [debug] 19350#0: *88 calling handler 1
2012/09/29 18:23:51 [debug] 19350#0: *88 calling handler 2
2012/09/29 18:23:51 [debug] 19350#0: *88 calling handler 3
2012/09/29 18:23:51 [debug] 19350#0: *88 recv: fd:18 -1 of 280
2012/09/29 18:23:51 [debug] 19350#0: *88 recv() not ready (11: Resource temporarily unavailable)
2012/09/29 18:23:51 [debug] 19350#0: timer delta: 5
2012/09/29 18:23:51 [debug] 19350#0: posted events 0000000000000000
2012/09/29 18:23:51 [debug] 19350#0: worker cycle
2012/09/29 18:23:51 [debug] 19350#0: epoll timer: -1
```sh

liluxdev commented 12 years ago

P.S. but also when all is working well I get this debug message log about 1 time at second:

2012/09/29 18:55:53 [debug] 20356#0: *2 recv() not ready (11: Resource temporarily unavailable)

this is the "all working well" case log

2012/09/29 18:55:53 [debug] 20356#0: *2 RTMP recv audio (8) csid=4 timestamp=141                                                                                        3492 mlen=262 msid=1 nbufs=1
2012/09/29 18:55:53 [debug] 20356#0: *2 nhandlers: 4
2012/09/29 18:55:53 [debug] 20356#0: *2 calling handler 0
2012/09/29 18:55:53 [debug] 20356#0: *2 record: day frame: mlen=262
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 13, 00007FFF6AAE1FA0, 11, 1456938                                                                                        2
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 13, 0000000000D108F1, 262, 145693                                                                                        93
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 13, 00007FFF6AAE1FA0, 4, 14569655
2012/09/29 18:55:53 [debug] 20356#0: *2 record: hour frame: mlen=262
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 14, 00007FFF6AAE1FA0, 11, 1456938                                                                                        2
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 14, 0000000000D108F1, 262, 145693                                                                                        93
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 14, 00007FFF6AAE1FA0, 4, 14569655
2012/09/29 18:55:53 [debug] 20356#0: *2 record: five frame: mlen=262
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 3, 00007FFF6AAE1FA0, 11, 1841232
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 3, 0000000000D108F1, 262, 1841243
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 3, 00007FFF6AAE1FA0, 4, 1841505
2012/09/29 18:55:53 [debug] 20356#0: *2 record: min frame: mlen=262
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 15, 00007FFF6AAE1FA0, 11, 552628
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 15, 0000000000D108F1, 262, 552639
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 15, 00007FFF6AAE1FA0, 4, 552901
2012/09/29 18:55:53 [debug] 20356#0: *2 record: default frame: mlen=262
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 17, 00007FFF6AAE1FA0, 11, 1456938                                                                                        2
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 17, 0000000000D108F1, 262, 145693                                                                                        93
2012/09/29 18:55:53 [debug] 20356#0: *2 write: 17, 00007FFF6AAE1FA0, 4, 14569655
2012/09/29 18:55:53 [debug] 20356#0: *2 calling handler 1
2012/09/29 18:55:53 [debug] 20356#0: *2 live: av: audio timestamp=1413492
2012/09/29 18:55:53 [debug] 20356#0: *2 RTMP prep audio (8) fmt=1 csid=6 timesta                                                                                        mp=26 mlen=262 msid=1 nbufs=1
2012/09/29 18:55:53 [debug] 20356#0: *596 RTMP send nmsg=1, priority=1 #42
2012/09/29 18:55:53 [debug] 20356#0: *596 send: fd:20 270 of 270
2012/09/29 18:55:53 [debug] 20356#0: posted event 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: *264 RTMP send nmsg=1, priority=1 #9
2012/09/29 18:55:53 [debug] 20356#0: *264 send: fd:19 270 of 270
2012/09/29 18:55:53 [debug] 20356#0: posted event 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: *253 RTMP send nmsg=1, priority=1 #217
2012/09/29 18:55:53 [debug] 20356#0: *253 send: fd:12 270 of 270
2012/09/29 18:55:53 [debug] 20356#0: posted event 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: *135 RTMP send nmsg=1, priority=1 #50
2012/09/29 18:55:53 [debug] 20356#0: *135 send: fd:16 270 of 270
2012/09/29 18:55:53 [debug] 20356#0: posted event 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: *59 RTMP send nmsg=1, priority=1 #208
2012/09/29 18:55:53 [debug] 20356#0: *59 send: fd:18 270 of 270
2012/09/29 18:55:53 [debug] 20356#0: posted event 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: *2 calling handler 2
2012/09/29 18:55:53 [debug] 20356#0: *2 calling handler 3
2012/09/29 18:55:53 [debug] 20356#0: *2 recv: fd:11 -1 of 280
2012/09/29 18:55:53 [debug] 20356#0: *2 recv() not ready (11: Resource temporari                                                                                        ly unavailable)
2012/09/29 18:55:53 [debug] 20356#0: timer delta: 4
2012/09/29 18:55:53 [debug] 20356#0: posted events 0000000000000000
2012/09/29 18:55:53 [debug] 20356#0: worker cycle
2012/09/29 18:55:53 [debug] 20356#0: epoll timer: -1

liluxdev commented 12 years ago

PS. to reproduce is important to wait the FMLE timeout, so try to keep network down on the encoder for about 30 seconds not 10 as i said before

kghost commented 12 years ago

have you tested with ping/ping timeout set on the server side ?

liluxdev commented 12 years ago

@kghost how to set this?

liluxdev commented 12 years ago

my nginx-rtmp is become completley unusable! now no one can push also across a reboot... and yesterday i launched it in production :(

liluxdev commented 12 years ago

Current log is:

2012/10/02 15:16:14 [debug] 18040#0: *58 AMF cmd 'receiveaudio' no handler
2012/10/02 15:16:14 [debug] 18040#0: *58 reusing formerly read data: 11
2012/10/02 15:16:14 [debug] 18040#0: *58 RTMP bheader fmt=3 csid=2
2012/10/02 15:16:14 [debug] 18040#0: *58 RTMP mheader fmt=3 user (4) time=16309451+0 mlen=10 len=0 msid=0
2012/10/02 15:16:14 [debug] 18040#0: *58 RTMP recv user (4) csid=2 timestamp=16309451 mlen=10 msid=0 nbufs=1
2012/10/02 15:16:14 [debug] 18040#0: *58 nhandlers: 1
2012/10/02 15:16:14 [debug] 18040#0: *58 calling handler 0
2012/10/02 15:16:14 [debug] 18040#0: *58 RTMP recv user evt set_buflen (3)
2012/10/02 15:16:14 [debug] 18040#0: *58 msid=1 buflen: 3000 (msec)
2012/10/02 15:16:14 [debug] 18040#0: *58 recv: fd:13 -1 of 146
2012/10/02 15:16:14 [debug] 18040#0: *58 recv() not ready (11: Resource temporarily unavailable)
2012/10/02 15:16:14 [debug] 18040#0: timer delta: 498
2012/10/02 15:16:14 [debug] 18040#0: posted events 0000000000000000
2012/10/02 15:16:14 [debug] 18040#0: worker cycle
2012/10/02 15:16:14 [debug] 18040#0: epoll timer: -1

PLEASE HELP ME I DON'T TOUCHED THE CONF BUT SINCE 2 HOURS AGO ALSO IF I REBOOT EVERYTHING THE STREAMING SERVER DOESN'T WORKS, I CANT UNDERSTAND WHAT'S HAPPENING

**UPDATE: it' wasnt realated to this bug: it's related to wrong on_publish url

arut commented 12 years ago

Sorry for late answer. You should use ping or tcp keepalive.

Ping example (within server{} block)

server {
    ...
    ping 10s;
    ...
}

10 seconds can be changed to any other value of your choise.

keepalive works the same way as nginx/http bit I think it's not needed.

When network gets disconnected the way you have described tcp connection usually hangs for a long time until keepalive/retransmission timeout expires. RTMP ping fixes that.

liluxdev commented 12 years ago

Thanks, very useful i set it now to 5s or you think this is too short?

PS. i solved my last problem : today I migrated the domain of my customer site to a production one, so i had:

old: beta.radioabc.domain new: radioabc.domain

redirected with a "moved permanetly" http redirect!

but i had in nginx rtmp:

on_publish http://beta.radioabc.domain/pub.php on_play http://beta.radioabc.domain/play.php

and the bad http response (redirect, moved) code blocks everyting from starting up, so first i had blocked silenly new "plays" and when i rebooted desperatley i had blocked the entire publishing

But the important is that now i understood :)

arut commented 12 years ago

I'm using 1 minute ping. 5 seconds is too low value

SkeLLLa commented 12 years ago

Hello. I've got similar issue with Flash Live Encoder. I think it happens when network on encoder goes down for a few seconds. After that FLE tryes to reconnect, but fails. Here is part of FLE logs:

Fri Nov 16 2012 10:05:54 : Primary - More than 50 % of the specified threshold values (Max buffer size 512000 KB or Max buffer duration 3600 sec) of RTMP buffer reached. Current network conditions are unfavorable to publish data at the specified bitrate. RTMP buffer will be flushed and streams will be re-published when the threshold value is reached. 
Fri Nov 16 2012 10:17:54 : Primary - More than 70 % of the specified threshold values (Max buffer size 512000 KB or Max buffer duration 3600 sec) of RTMP buffer reached. Current network conditions are unfavorable to publish data at the specified bitrate. RTMP buffer will be flushed and streams will be re-published when the threshold value is reached. 
Fri Nov 16 2012 10:29:54 : Primary - More than 90 % of the specified threshold values (Max buffer size 512000 KB or Max buffer duration 3600 sec) of RTMP buffer reached. Current network conditions are unfavorable to publish data at the specified bitrate. RTMP buffer will be flushed and streams will be re-published when the threshold value is reached. 
Fri Nov 16 2012 10:35:54 : Primary - RTMP buffer exceeded the specified threshold values (Max buffer size 512000 KB or Max buffer duration 3600 sec). Flushing the RTMP buffer  and re-publishing the streams.
Fri Nov 16 2012 10:35:59 : Primary - Network Status: NetConnection.Connect.Closed status
Fri Nov 16 2012 10:35:59 : Primary - Disconnected
Fri Nov 16 2012 10:36:00 : Primary - Connected to FMS/3,0,1,123
Fri Nov 16 2012 10:36:00 : Primary - Stream[0] Status: Success
Fri Nov 16 2012 10:36:01 : Primary - Stream[0] Status: NetStream.Publish.Start
Fri Nov 16 2012 10:36:01 : Primary - Network Status: NetConnection.Connect.Closed status
Fri Nov 16 2012 10:36:07 : Primary - Disconnected
Fri Nov 16 2012 10:36:09 : Primary - Re-establishing connection, attempt 1
Fri Nov 16 2012 10:36:09 : Primary - Reconnected to FMS/3,0,1,123
Fri Nov 16 2012 10:36:09 : Primary - Stream[0] Status: Success
Fri Nov 16 2012 10:36:09 : Primary - Stream[0] Status: NetStream.Publish.Start
Fri Nov 16 2012 10:36:12 : Primary - Network Status: NetConnection.Connect.Closed status
Fri Nov 16 2012 10:36:18 : Primary - Disconnected

And this part repeats untill manual stop and start FLE.

Fri Nov 16 2012 10:36:09 : Primary - Re-establishing connection, attempt 1
Fri Nov 16 2012 10:36:09 : Primary - Reconnected to FMS/3,0,1,123
Fri Nov 16 2012 10:36:09 : Primary - Stream[0] Status: Success
Fri Nov 16 2012 10:36:09 : Primary - Stream[0] Status: NetStream.Publish.Start
Fri Nov 16 2012 10:36:12 : Primary - Network Status: NetConnection.Connect.Closed status
Fri Nov 16 2012 10:36:18 : Primary - Disconnected

Only after manual restart nginx starts receiving stream, Nginx version 1.3.7, rtmp module git version from 15.11.2012. Also tried to set ping values from 10s to 3m, nothing helped.

liluxdev commented 12 years ago

Exactley same log in my case! My problem wasn't solved with ping directive: i've to do a manual restart of FLME (Manual restart of nginx doesn't solve anything in my case, so now i think is a FLME bug)

SkeLLLa commented 12 years ago

Maybe it's FMLE bug, but when it streams to Adobe FMS it reconnects normally.

liluxdev commented 12 years ago

Uhm... i'm asking if https://github.com/arut/nginx-rtmp-module/issues/20 fix can help us.... Do you tried the new-live3 branch?

SkeLLLa commented 12 years ago

Just compiled new-live3 branch and set drop_idle_publisher param. I'll test it and say if it helps.

SkeLLLa commented 11 years ago

Module compiled from new-live3 branch with drop_idle_publisher param not helped.

arut commented 11 years ago

Could you provide full debug log of nginx? Both before and after the network problem.

SkeLLLa commented 11 years ago

It will be quiet difficult because it appears after sevral hours of usage. I tried manually reproduce by unplugging nework cable for some seconds, but it not appeared. I'll try more with enabled debugging, but I don't know if I catch it. Also in error logs I've found string "Dropping idle publisher ... ", just after disconnect. So according to this, drop_idle_publisher param is working, but result is the same.

arut commented 11 years ago

It's ok for me to receive big log.

SkeLLLa commented 11 years ago

Check this one http://live.peopleair.tv/live/error_19112012. I'm not 100% shure that "before" period is present here (connection lost in the same time when I enable log), but "after" preiod is present.

liluxdev commented 11 years ago

I confirm I cannot reproduce the bug disconnecting network cable too: it only happens only after 5-7 hours of publishing after first small network down ( I can confirm the net down because I've a ping logger with google.com and it also goes down for 2-3 secs when publisher loses the connection)

arut commented 11 years ago

Here's the error in log 'RTMP in chunk stream too big: 64 >= 32'. Because of that error connection is closed by nginx and started again by the other side (FMS). That repeats again and again.

Max chunk stream number in nginx-rtmp is 32 by default. It can be increased with max_streams directive. However I'm not sure that will help (however you can try).

SkeLLLa if you could sniff network (with tcpdump or wireshark) at that moment (when the above mentioned error appears in log) that would be just amazing! You just have to wait until that error persists in log and sniff of 1-5 minutes of traffic.

arut commented 11 years ago

There's another solution you can try. Write this in server{} (not application{}!) block.

publish_time_fix off
play_time_fix off;

SkeLLLa commented 11 years ago

I've set RTMP chunk to 128, also set publish_time_fix and play_time_fix as you recommended. But stream still after several hours drops. I've got new errors (repeats many times after stream drop):

2012/11/22 09:05:39 [info] 28919#0: *2 createStream, client: 176.36.227.154, server: 0.0.0.0:1935
2012/11/22 09:05:39 [info] 28919#0: *2 publish: name='0' args='videoKeyframeFrequency=5&totalDatarate=906' type=live silent=0, client: 176.36.227.154, server: 0.0.0.0:1935
2012/11/22 09:05:40 [info] 28919#0: *2 too big message: 1048576, client: 176.36.227.154, server: 0.0.0.0:1935
2012/11/22 09:05:40 [info] 28919#0: *2 disconnect, client: 176.36.227.154, server: 0.0.0.0:1935
2012/11/22 09:05:40 [info] 28919#0: *2 deleteStream, client: 176.36.227.154, server: 0.0.0.0:1935

Here is dumped log http://live.peopleair.tv/live/tcpdump_nginx

arut commented 11 years ago

I've fixed that. Please 1) check out fmle-time-fix branch 2) remove publish_time_fix/play_time_fix/max_streams lines from config

if something goes wrong please report (+dump). if it works ok I will merge that branch into master

SkeLLLa commented 11 years ago

Builded new nginx from fmle-time-fix, seems that FMLE autoreconnected successfully (before it was in "down" state). I'll test it for some days to be shure that fix works. Many thanks.

SkeLLLa commented 11 years ago

After a day of testing FMLE autoreconnects normally. Can't test any more, because we moved to linux+ffmpeg.

arut commented 11 years ago

That's great, thanks for testing.

liluxdev commented 11 years ago

Arut, please let me know when you merge that into master and i will update and test it for a long period in my "production" and #77 issue affected environment (i only have a production one on this project, so i want update only from the master branch!)

arut commented 11 years ago

OK, I'll report here

liluxdev commented 11 years ago

Thanks

liluxdev commented 11 years ago

I didn't resisted to wait :) from now i'm testing the fmle-time-fix branch directly in my high load environment, I'll report the result.

liluxdev commented 11 years ago

@arut The fix is awesome: for the first time i had no downtime and no require for manual FMLE restart over 3 days (still running with no downtime: testing it in my prod. environment with a minimum of 100 connected clients and over 6000 listeners per day)

arut commented 11 years ago

Great news. Thanks for your report. I'll merge it.

matiaspl commented 11 years ago

The discussion might sugest, that if you use FMLE with main branch (fmle-time-fix seems to have been merged into the main branch), you should not use publish_time_fix/play_time_fix/max_streams in the config. We found that this is not the case, and for FMLE to work correctly with nginx-rtmp you still have to include publish_time_fix. We also use drop_idle_publisher, just in case.

arut commented 11 years ago

I'll check that later (I'm out of city now and I only have tablet with me). It's still possible that the directive is needed for fmle but not needed for ffmeg which is the default and most wanted.

fsram commented 11 years ago

I use "flme-time-fix" ...

/home/root/nginx-rtmp-module-fmle-time-fix/ngx_rtmp.c: In function ‘ngx_rtmp_add_addrs’: /home/root/nginx-rtmp-module-fmle-time-fix/ngx_rtmp.c:681: error: passing argument 2 of ‘ngx_sock_ntop’ makes integer from pointer without a cast src/core/ngx_inet.h:110: note: expected ‘socklen_t’ but argument is of type ‘uchar ’ /home/root/nginx-rtmp-module-fmle-time-fix/ngx_rtmp.c:681: error: passing argument 3 of ‘ngx_sock_ntop’ makes pointer from integer without a cast src/core/ngx_inet.h:110: note: expected ‘u_char ’ but argument is of type ‘long unsigned int’ /home/root/nginx-rtmp-module-fmle-time-fix/ngx_rtmp.c:681: error: too few arguments to function ‘ngx_sock_ntop’ make[1]: _* [objs/addon/nginx-rtmp-module-fmle-time-fix/ngx_rtmp.o] Error 1 make[1]: Leaving directory `/usr/local/nginx' make: *\ [build] Error 2

my system: centos 6.4

arut commented 11 years ago

It's pretty old. All the changes are now merged into master.

arut commented 10 years ago

FMLE problems https://github.com/arut/nginx-rtmp-module/issues/354

jonlatorre commented 9 years ago

Hi. Same problem here. If the emiter connection breaks up is imposible to transmit again until nginx restar. I've upload teh debug log:

http://webminar.kernet.es:8080/error_reconectar_debug

And the conf is worker_processes 14; rtmp_auto_push on; rtmp_auto_push_reconnect 1s;

rtmp { server { access_log /var/log/nginx/rtmp-access.log; listen 1935; ping 2s; ping_timeout 1s; notify_method get; max_streams 128; chunk_size 128; publish_time_fix off; application recepcion { live on; meta copy; allow publish all; allow play all; } }

anas10s commented 7 years ago

hello sir ,

when i restart my internet connection i lose connection to the server rtmp incomming and i will have to reset my ngnix server manually is what to fix this anomaly my config

rtmp_auto_push on; rtmp { server {

listen 8001; push_reconnect 3s; interleave on; wait_video on; idle_streams off; chunk_size 4096; notify_method get; ping 10s; ping_timeout 600s;

thank you in advance for your help

arut / nginx-rtmp-module

If the publisher connection flaps, nginx rtmp needs a reboot to work well #77