Open GoogleCodeExporter opened 9 years ago
A tried to trackdown this issue too. CPU load is unnaturally too high. I did
profiling on siprtmp. The load is located in multitasked socket reading. In my
humble opinion problem is located in p2p-sip packet, in file rfc3550.py, class
Network, function receiveRTP, line:
data, remote = yield multitask.recvfrom(sock, self.maxsize)
There is unbuffered rtp reading and multitasking is fireing reads much to
offen. I thought that writing buffer for rtp like it is done in rtmp part would
fix the problem, but it seems to be impossible to work well in current
architecture due to problems:
- in multitasking reading more then one DATAGRAM packet per task when packets
are coming one per about 20ms would block the task machine causing other rtp
read tasks to hang, with would lead to bad audio quality in live stream
building up quickly with a number of connections.
- if one some would somehow solve multitask problem, the new issue spikes.
Python sockest lacks one packet read function (I found plugin that is able to
workaround this problem). So, It is possible to read fragmented packet while
buffering (due to some funny audio quality issues with I had I suspect that it
happens even now without the buffering but I havent got time to prove it). An
option is to write payload parser, but it will be dependent of stream payload
type, and will be complex.
- one can thing that it can be fixed by disallowing to fire read task more
offen than X miliseconds wher X is lover than 20 by checking delay since last
read. But, this startegy will fail due to above problems with fragmented
packets, with should be fixed in first place
Original comment by Lukasz.P...@gmail.com
on 16 Mar 2011 at 12:18
it seems to be hard to resolve it, since if you buffer audio, it will
automatically create latency, and in VOIP latency is very problematic.
maybe convert python in C or C++ would reduce the CPU
(http://shed-skin.blogspot.com can be an alternative, but lacks of important
functions), maybe Kundan will find a way to resolve this issue.
Original comment by pratn...@gmail.com
on 16 Mar 2011 at 4:32
also there is maybe something more interesting
http://cython.org/
it helps use C extension in python
Original comment by pratn...@gmail.com
on 16 Mar 2011 at 5:03
yes, buffering is a pain due to latency. In my tests (videoPhone to way speex16
conversation) switching off rtp reading reduced load 3 times. So, rtp reading
is probably responsible for about 66% of load. It would be nice If some body
was able to criticize my hypothesis about the source of the CPU load.
Original comment by Lukasz.P...@gmail.com
on 17 Mar 2011 at 12:21
I read my notes from profiling a moment ago. During the profiling I noted that
select.select on rtp sockets in multitasking (multitask.py) is responsible for
CPU load when reading rtp.
Original comment by Lukasz.P...@gmail.com
on 17 Mar 2011 at 1:02
nice, so did you find an alternative of select.select ?
i'm not a python expert sorry...
Original comment by pratn...@gmail.com
on 17 Mar 2011 at 6:45
It seems that to much calls to select.select due to improper multitasking can
lead to similar CPU overload too. Multitasking could behave differently with
datagrams or rtp reading trigers bug in multitasking, like poll like
select.select behavior when timeout is 0.0.
Original comment by Lukasz.P...@gmail.com
on 18 Mar 2011 at 9:39
I am not an python expert, but maybe together we will boil some solution in the
discussion. All my attempt to reduce load failed. The only thing I can
recommend for now is to reduce rtp reading as much as possible.
Original comment by Lukasz.P...@gmail.com
on 18 Mar 2011 at 9:45
ok, I think you're right, since you have some proof of what it can causes the
load.
I will try to google it to find a select.select alternative. Hope Kundan, the
author of rtmplite will find also something.
I will post here if I found something interesting.
good luck
Original comment by pratn...@gmail.com
on 18 Mar 2011 at 5:19
fond this blog with a python chat application
http://code.activestate.com/recipes/531824-chat-server-client-using-selectselect
/
in comment someone complaints of 60% CPU load for this simple python chat.
I think we have enough proof to say that select.select is the problem
Original comment by pratn...@gmail.com
on 18 Mar 2011 at 5:28
this link is interesting also
http://twistedmatrix.com/documents/current/core/howto/choosing-reactor.html#auto
1
Original comment by pratn...@gmail.com
on 18 Mar 2011 at 5:47
maybe the solution is there
http://www.artima.com/weblogs/viewpost.jsp?thread=230001
Original comment by pratn...@gmail.com
on 18 Mar 2011 at 6:11
seems that Twisted modules are the solution of
I/O asynchronous python task
http://wiki.python.org/moin/Twisted-Examples
only 5 lines of code to create a proxy server, impressive!
Original comment by pratn...@gmail.com
on 18 Mar 2011 at 6:34
I feel one good solution would be to move the media transport processing to
separate C/C++ module and load it from Python for high performance servers. It
can then use other techniques (epoll, poll, etc) and keep the media path in
C/C++. This will also give an opportunity to implement multi-threaded media
transport processing, which currently is not possible easily with mutitask.py.
If I get support for working on this, I can attempt it. Otherwise, will have to
wait on this until I find a someone/student to work on it. A friend suggested
CPython.
Original comment by kundan10
on 20 Mar 2011 at 2:41
After reading your above post I checked poll/epoll docs. poll/epoll is more
efficient then select.select due to better scaling with the number of sockets.
It seems to be possible to implement python implementation of
select.poll/select.epoll in python multitask.py library. This kind of approach
is described in python documentation. Is it wise to try to implement it in that
way to gain some CPU load improvement? Should I try? What is your experience in
python select.poll/select.epoll use? Please, advise.
Original comment by Lukasz.P...@gmail.com
on 20 Mar 2011 at 9:17
yes poll/eppoll are more efficient, but means that Unix or Windows... ;)
epoll works only on Unix. but for me it's not a problem. I think for now change
select.select to epoll is the right instant solution... Kundan ?
Original comment by pratn...@gmail.com
on 20 Mar 2011 at 9:34
Multitask registers/unregisters sockets per read/write request, so there would
be a need to register/unregister socket with poll with every read/write
request, so it can be slower than expected. Can You estimate this additional
load?
Original comment by Lukasz.P...@gmail.com
on 20 Mar 2011 at 5:51
if it makes the cpu chargeless, yes,
I never compared.
Original comment by pratn...@gmail.com
on 20 Mar 2011 at 6:04
I finished my second look at CPU load. Tests shows that select.select()
strategy isn't the source of the problem. I was able to reduce select.select()
impact to 25% (66% previously) but it reduced the load by 25% only. I did it by
altering multitask to check for IO not more then 100 times per second. Decrease
was only 25% because usage of time.delay increased to 39% CPU (many short 0.01
waits due to multitask.sleep(0.01) in rtmp.py). Moreover I noticed that without
multitask modification addition of more waits does not change load and quality.
It seems that whole multitasking is unbalanced in rtmpsip/multitask. I suspect
that multitask does not work as expected. I suspect that rtmpsip was build on
assumptions with are not met by multitask.
Original comment by Lukasz.P...@gmail.com
on 24 Mar 2011 at 2:20
interesting, so it means that select.select is one among other problems...
anyway, multitask libraries is not updated since 2007 and the author (I wrote
to him)
definetly abandonnated the maintain of this project. So I think change
multitask to (twisted matrix ?) another async multithread solution would be a
priority if siprtmp wants to evolve.
Original comment by pratn...@gmail.com
on 24 Mar 2011 at 2:37
I remember my PHP experience about multithread sockets and CPU that I used a
socket function that was working very well with less than 1% CPU load when I
used NULL as timeout paramater, and if I used any number for it the CPU was at
99% all the time...
maybe it can help
Original comment by pratn...@gmail.com
on 24 Mar 2011 at 2:41
In my opinion decreasing elementarty select.select execution time (like pool
implementation) will not help, because gained time will be consumed by other
components like in my tests.
Original comment by Lukasz.P...@gmail.com
on 24 Mar 2011 at 5:03
ok, how about NULL in execution time ? is python will accept this value for
exec time ?
Original comment by pratn...@gmail.com
on 24 Mar 2011 at 5:08
I did some fixes in the code in SVN r60 to reduce single call CPU by about 55%.
The changes are tested only for siprtmp.py call, so may break other stuff.
Following are the changes and corresponding improvements:
0. Single call was taking 10-12% CPU. Even without a call it took 1.7-1.8%.
1. Changed rtmp.py to use multitask.Queue instead of wait for 0.01 in write
method. This was earlier marked as TODO item. This saves idle mode CPU, so when
not in a call, it now takes 0.0%. Also in a call, reduces to about 8.5%.
2. Changed multitask.py to handle all the tasks before going back to io-waits
which does select() call. This reduces the number of select calls. After this
it takes 4.6-4.8% CPU per call. Now every 20ms, the select gets called about
twice instead of 10-20 times, which is right.
If I ignore incoming RTP packets after receiving then, call takes 2.5-2.6%, so
about 2.1-2.3% is taken in processing RTP->RTMP media path. Doesn't look like
we can do more optimization in Python code using multitask. Some alternatives
are:
1. Replace the media path to be in C module. Given the way rtmp.py is written,
this is tricky.
2. I found gevent to be pretty easy to use, so perhaps multitask can be
replaced by gevent (after lot of modification, of course).
Original comment by kundan10
on 28 Mar 2011 at 9:35
Great, I reviewed it a moment ago. I will test it and give feedback here.
I suspects 2. leads to freeze bug if you have non IO generators because there
will be always some new task as a result of task queue processing (I suspect
that 'for' was to prevent it from happening). So, I propose to fire
handle_io_waits not less then some threshold (e.g. 0.01 s) when calculated
timeout is equal 0.0 (i.e. there are some tasks).
I suspects that in general in rtmp.py parseMessages generator yields to
taskmanager too often. This leads to tasks overload in taskmanager. In general
deeper profiling is tricky because time.sleep() and select.select() cheats
cprofile profiler somehow saying that waits consumes CPU. RTP IO generates 0.01
s waits due to lack of RTP buffering.
Original comment by Lukasz.P...@gmail.com
on 28 Mar 2011 at 10:08
excellent Kundan, I will try your changes today. As Lukasz said do you think
that create as very small buffer (for ex 20ms) will decrease CPU processing ?
Original comment by pratn...@gmail.com
on 28 Mar 2011 at 3:32
[deleted comment]
[deleted comment]
I updated siprtmp.py, rtmp.py and tested siprtmp.py. No errors were found.
Original comment by Lukasz.P...@gmail.com
on 28 Mar 2011 at 3:51
updated and test the CPU, excellent work Kundan ! 0.0% without call,
one call now represents 2% max, which is amazing. BUT, unfortunatly I get
no audio on both side. However I can see the RTP and RTMP flow in debug without
errors.
and no error also on RTMP side, no exeception in siprtmp. I'm stucked :(
Original comment by pratn...@gmail.com
on 29 Mar 2011 at 7:40
I tested changes in multitask.py too. Cumulative decrease of load in my test is
only about 25%. Approach proposed by me in Comment 25 gives same results. I
tested with two bidirectional connections from VideoPhone.swf speex-16kHz on
one PC to freeswitch conference on remote server (network ping was about 1ms,
voice was provided and received by head phones with mic). I did cProfile
profiling, profiling results are attached to this comment. Maybe it can help
here.
Original comment by Lukasz.P...@gmail.com
on 29 Mar 2011 at 10:11
Attachments:
In the test I used one 20ms frame per packet RTP stream.
Original comment by Lukasz.P...@gmail.com
on 29 Mar 2011 at 10:25
I analyzed the load from attachment. The another global load enhancement is to
reduce yielding to minimum. Every 'yield' generates call to multitask causing
task processing overhead. FYI: cPython measures time spent in a function, and
it is not equal to CPU load.
Original comment by Lukasz.P...@gmail.com
on 29 Mar 2011 at 12:31
apparently Kundan added new yielding, especially for call() RTMP side functions.
BTW, the result is better so I'm not sure we can here reduce more CPU in python.
on my side I'm yet stucked to make audio work with a server side netConnection
stream.
no error msg that helps me..
Original comment by pratn...@gmail.com
on 29 Mar 2011 at 4:17
Hi pratn...@gmail.com, are there other changes in your version of
rtmplite/p2p-sip? Could you send me the zip of your files to kundan10@gmail.com
so that I can try it out too.
Hi Lukasz.P...@gmail.com, yes I added more yield because after remove sleep of
0.01 in rtmp.py, and using multitask.Queue, lot of other functions needed to be
changed to generator.
Also, just to be sure that you are using rtmplite before p2p-sip in PYTHONPATH,
otherwise it will pick the wrong multitask.py. The rtmplite/multitask.py is
right, but there is another multitask.py in p2p-sip.
Original comment by kundan10
on 29 Mar 2011 at 5:53
Yes, 'yieald's introduced to the code with multitask.Queue must stay. As I said
before, I suspects that parseMessages generator yields to taskmanager too often
(in rtmp.py). This leads to tasks overload in taskmanager. If there are data in
the rtmp buffer, then yield should not be invoked. I am testing workaround for
it. I will send details tomorrow.
Unfortunately logs shows that I am using rtmplite/multitask.py.
Original comment by Lukasz.P...@gmail.com
on 29 Mar 2011 at 6:39
Kundan,
- does it mean that PYTHONPATH has to have the path of rtmplite AND p2p-sip ?
- ok I'll send to you my hacks, I nedeed in fact to get siprtmp compatible with
server side NetConnection object which is sligthly different of client one.
Don't know if it can be useful to implement it yet since I get no audio on last
revision (r60). need to figure it out.
Lukasz, it seems that you know deeply the problem and your help in this project
would be precious...
Original comment by pratn...@gmail.com
on 29 Mar 2011 at 11:57
ok I completly messed up my own update. I used an old revision.
after inserted them manually, finally it works again.
without calls siprtmp is at 0.0%, with one call max cpu is at 5.6% with is like
Lukasz about 25% better that the previous version.
I noticed a more stable audio quality with less glitches.
Original comment by pratn...@gmail.com
on 30 Mar 2011 at 8:43
Today there was no time for 'yield' issue, I will back to this it soon.
Original comment by Lukasz.P...@gmail.com
on 30 Mar 2011 at 6:27
Ok, I wrote patch for rtmp.py trunk attached to this message, which is reducing
Reactor usage when parsing RTMP stream. It is only a quick fix. In general rtmp
parsing approach should be reorganized to reduce reactor usage in more elegant
way.
Original comment by Lukasz.P...@gmail.com
on 3 Apr 2011 at 3:44
Attachments:
thanks lukasz I'm trying it now.
Original comment by in...@boophone.com
on 3 Apr 2011 at 4:20
I had limited resources for tests so sorry for any typo :)
Original comment by Lukasz.P...@gmail.com
on 3 Apr 2011 at 4:26
My approach to multitask.py is a little different than in trunk and it is
attached as a patch to this message. 0.01 constant is equal 0.01 and not more
because reading more then one rtp packet at once can cause rtp parsing to fail
due to lack of more then one packet per read handler.
Original comment by Lukasz.P...@gmail.com
on 3 Apr 2011 at 4:32
Attachments:
great. did you see performance difference between r68 and your patches ?
Original comment by in...@boophone.com
on 3 Apr 2011 at 5:04
just tried r68 with and without lukasz patch
the cpu for one call is around 10 / 11% which is 2 times more that the previous
version
(5.9/7%)
thanks
Original comment by in...@boophone.com
on 3 Apr 2011 at 7:01
[deleted comment]
sorry I correct : r68 is 2 to 3% greater that r60
Original comment by in...@boophone.com
on 3 Apr 2011 at 8:02
OK. Is there any performance change for you comparing r68 and r68+my_patches?
My patches for 2 files are independent, you can apply only one at a time and
test. What are the results?
Original comment by Lukasz.P...@gmail.com
on 4 Apr 2011 at 6:30
it seems that not patched version consume 1/2% less.
I will try one file at time tomorrow
thanks
Original comment by in...@boophone.com
on 4 Apr 2011 at 6:37
Hi Lukasz,
result:
- multitask.py patch increase to 1% and there's some hangup problem (calls
don't hangup)
- rtmp.py patch gives almost the same CPU result as not patched (maybe
sometimes 0.3% less with your patch). no problem of hangup at all.
Original comment by in...@boophone.com
on 4 Apr 2011 at 6:24
Original issue reported on code.google.com by
Boophone...@gmail.com
on 8 Feb 2011 at 11:08