Open ijsf opened 9 years ago
Is the OpenWebRTC app Bowser or a native app?
Bowser. Will definitely test this with the native app.
I would be very interested in any performance comparison that you do. Thanks!
NativeDemo (with stefanalund's framework):
NativeDemo (with my current framework):
@ijsf sorry but the framework I sent you before is not the one I got working. Here is the one that I have used to get audio and video working in both directions (owr<->Chrome, owr<->owr): http://static.verkstad.net/OpenWebRTC.framework_old.zip
This framework build is from Feb 28 and still requires a lot of optimizations. I was testing on iPhone 6+ and iPhone 5s (both on iOS 8.2).
Let me know if you also get video in both directions using this older framework.
I'm seeing 1 frame of data, then the iOS app crashes with a EXC_BAD_ACCESS in one of the vp8 functions (variance16x16_neon_loop). This may be an old libvpx issue, e.g. https://github.com/EricssonResearch/cerbero/commit/174661c6ebd8fe796d604280fd3d4ddcdba371a2 or https://code.google.com/p/webm/issues/detail?id=737
Just tested with a clean checkout of OpenWebRTC, some partially good news:
(This indicates that the green video issue I posted before may be caused by my orc changes.)
Wouldn't I have hit the same problem if there was a problem in libvpx? Unless it happens under higher CPU load.
Trying to find out more information.
In the meantime, I'm also getting GL related errors and no video output on iOS. I was also having issues with this in pererikb's GL Bowser patches:
** Message: Got remote source: Remote video stream
0:00:32.689136000 779 0x15ac4068 WARN gldisplay gstgldisplay.c:188:gst_gl_display_new: Could not create display. user specified (NULL) (platform: (NULL)), creating dummy
@stefanalund
First of all, it seems I didn't test the green video bug correctly with/without my orc changes. I found a critical bug in orc's current NEON implementation causing all sorts of nasty issues, which I fixed in my fork at https://github.com/ijsf/OpenWebRTC-orc/commit/5e8ad19de2445d9d3611e241f317927ba54e8e56. The green video seems to have been caused by this bug. With the fix (and thus working orc NEON assembly implementation) the performance (even without any compiler optimization) is starting to become very good.
Secondly, the local iOS video display is still not working due to the above GL issue. Not sure what's going on there. Any idea?
UPDATE:
Actually, could this be caused by the EAGL code in gstreamer using OpenGL ES 3.x instead of 2.0? This would explain why it is not working on my device (not OpenGL ES 3.x compatible) but is on newer iPhones (OpenGL ES 3.x compatible).
I just tested with OpenGL ES Analyzer, and found out various OpenGL calls are failing in the gstreamer GL plugin:
180 0x1774acd0 0x75d37 0x0 [0x1774acd0 renderbufferStorage:GL_RENDERBUFFER fromDrawable:NULL]
181 0x1774acd0 0x75d37 0x0 GL_INVALID_OPERATION <- glGetRenderbufferParameteriv(GL_RENDERBUFFER, GL_RENDERBUFFER_WIDTH, {86278144})
182 0x1774acd0 0x75d37 0x0 GL_INVALID_OPERATION <- glGetRenderbufferParameteriv(GL_RENDERBUFFER, GL_RENDERBUFFER_HEIGHT, {86278144})
183 0x1774acd0 0x75d37 0x0 glBindRenderbuffer(GL_RENDERBUFFER, 0u)
184 0x1774acd0 0x75d37 0x0 GL_INVALID_OPERATION <- glRenderbufferStorage(GL_RENDERBUFFER, GL_DEPTH_COMPONENT16, 86278144, 86278144)
It seems that the EAGL layers passed into glBindRenderBuffer and fromDrawable are NULL, somehow.
UPDATE:
Tracked down the OpenGL issue, see: https://github.com/EricssonResearch/openwebrtc/issues/283 - with the appropriate fix (+ my orc changes) the received video on iOS seems to be very smooth.
\o/ Great! So, what's next? Adding armv8-a support or whatever it is called in Xcode?
Maybe you want to close this issue and use a specific one for tracking the orc work. Maybe it's an idea to submit that work to upstream orc before continuing with arm64 work? Then you can get it off your plate so to speak. :)
AWESOME work! Can't wait to have this land in GStreamer and eventually OpenWebRTC.
Regarding:
WARN gldisplay gstgldisplay.c:188:gst_gl_display_new: Could not create display. user specified (NULL) (platform: (NULL)), creating dummy
This is a bogus warning. The EAGL implementation just does not use gst_gl_display. @sdroege - this is a warning that happens always in the normal course of operation. It should probably be eliminated, no?
Update: Filed Bug 746933 - gldisplay: EAGL "Could not create display"
Well, there still seems to be one significant open issue with video on iOS making it useful for video conferencing:
In short, the latency is still very bad, and this is not due to real-time or performance issues: CPU utilization is actually <100% (video only, no audio) and I am testing with @stefanalund's NativeDemo. I am not seeing any gst WARNs during calling, except for two right after initializing the call as we've seen earlier in https://github.com/EricssonResearch/openwebrtc/issues/236:
0:00:17.694641000 1004 0x15253638 WARN dtlssrtpdemux gstdtlssrtpdemux.c:137:sink_chain:<dtls-srtp-demux> received invalid buffer: 1
0:00:21.654848000 1004 0x15253638 WARN rtpjitterbuffer rtpjitterbuffer.c:449:calculate_skew: resync to time 0:00:07.190769000, rtptime 3:36:37.835966666
What's going on here? Maybe this is causing the latency? Maybe @sdroege knows more about this particular issue?
@superdump: Don't really want to go through the administrative hoops of submitting stuff upstream outside of Github right now, if you don't mind ;) Will get to that at a later point. As far as orc is concerned, 64-bit NEON support would be a high priority issue. Don't really have the arm64/AArch64 hardware, so I am hoping somebody else could look at this asap.
The DTLS invalid buffer is due to packets received before handshaking is complete I think.
The clock skew recalculation is probably just an initial skew calculation I would guess but not sure.
Can you check what resolution, framerate and bitrate you're using? Of course 5 second latency is crazy, hopefully it's something pretty obviously wrong. I wonder if it's just something like queuing up video until the DTLS handshake is complete and then sending data or something like that. Except that it never catches up.
@sdroege - is there a good way to get GST_DEBUG=6 logs from iOS devices with minimal performance impact? I.e. Writing to a file and copying it over somehow?
@superdump I got the following information from NativeDemo:
2015-03-28 17:46:08.188 NativeDemo[1004:509913] Video settings: {
bitrate = 1024000;
framerate = 25;
height = 480;
mode = VGA;
width = 640;
}
Video is encoded with VP8, clockrate 90000.
I've made those parameters configureable, go to Settings -> NativeDemo to modify.
Would be interesting to see if a really low resolution (or frame rate) helps with the latency issue.
QCIF and QVGA are unusable and result in an oddly shaped black frame on the receiving end in Chrome.
CIF works and reduces the latency to 0.5 s or so initially, however latency gradually increases to several seconds and longer.
(Latency on the receiving end in iOS is initially great but also gradually increases to several seconds and more.)
As there is a dependence on resolution, it sounds like some thread is not able to complete processing in all elements in real time. For each thread, all processing for one frame must complete in at most one frame duration in order for real time processing. We can split the processing pipelines into more threads using queue elements. What we need to know now is how long each element that does processing spends processing each buffer.
Can you make a log with the following environment variables and paste it into a gist then link it here?
GST_DEBUG=GST_PADS:6,GST_ELEMENT_PADS:6,GST_SCHEDULING:7 GST_DEBUG_NO_COLOR=1
The GST_*PADS stuff is to figure out how the elements in the pipelines are connected together. The GST_SCHEDULING part prints a message every time a buffer is passed to the next element's chain function or so, which allows you to see when a buffer enters and exits an element. With this one can figure out where the bottlenecks are, what needs optimizing and where queues may be needed.
Insertion of queues causes the chain before the queue and after it to run in separate threads. This does not in itself decrease latency but it does allow more throughput by parallelising portions of the processing.
@superdump Unfortunately this makes the app unusable due to the pedantic output.
Right, the rendering of the log output in Xcode is really slow. What normally works is to be able to write the log output to a file. Is there some writable path on iOS? I think you used to be able to do that. If so, you can also set GST_DEBUG_FILE to the path of the file you want the gst debug log to be written and then copy it over afterwards.
Use something like:
setenv("GST_DEBUG_FILE", [[NSTemporaryDirectory() stringByAppendingPathComponent:@"foo.txt"] cStringUsingEncoding:NSUTF8StringEncoding], 1);
Then retrieve (the whole app container) like this: http://stackoverflow.com/a/28161494/130066
Good to know, thanks! @sdroege ^
@ijsf - maybe it's easier to dump pipeline graphs then to see how stuff is linked together. GST_DEBUG_DUMP_DOT_DIR should be set to the NSTemporaryDirectory
I'll use the opportunity to plug my Python thing for lldb: https://gist.github.com/ikonst/6874ff814ab7f2530b2a
Basically, I wrote it since after you setenv("GST_DEBUG_DUMP_DOT_DIR", [NSTemporaryDirectory() cStringUsingEncoding:NSUTF8StringEncoding], 1)
, you still need to retrieve the container each time from your iOS device, and that's really annoying.
Instead, you just do gst_dot some_gst_object
in your lldb prompt.
The Python code just retrieves it from the running process, saves the .dot locally and opens it in Graphviz. (As a bonus, for every GstObject it goes to the top of the hierarchy, i.e. to the bin.)
@ijsf - I'd like to see where your improvements take us. If you create a simple "self-view" pipeline, as done in the "Selfie" sample, make sure you're getting:
For reference, that's what you get on iPhone 5s with a simple iOS app (~20 lines of code, basically following the AVFoundation guide), capturing 1280x720 video and drawing each 32BGRA frame with Core Image unto an OpenGL layer.
Also for reference, H264-loopback iOS app which captures, encodes, decodes (via VideoToolbox) and renders takes up ~24% CPU on iPhone 5s. Understandably my app has no networking, while an OWR loopback (i.e. "test_send_receive" sample) sample has the additional overhead of networking (and also crypto and audio, if you choose to keep them enabled). Yet, the CPU usage I've been seeing is closer to 70%.
@ikonst I just tested your script and it seems to work good. I am having a hard time though to find a place to break the debugger and get a GstElement that is part of the pipeline (for NativeDemo). Do you have any suggestions when to break and which GstElement to pick?
As far as performance goes.. since a call at CIF resolution is also accumulating latency (note that this is even without the whole audio pipeline) I assume there's a serious issue in the gstreamer pipeline somewhere.
The top consumers irregardless of thread with CIF resolution on iOS according to Time Profiler during a two-way video-only call are:
9.2% mach_msg_trap / mach_msg / io_connect_method in IOKit. <- this is iOS' OpenGL pipeline overhead
5.5% vp8_pack_tokens
3.3% _platform_memmove / gst_buffer_copy_into / gst_mini_object_make_writeable / gst_inter_video_src_create <-- seems very wasteful
2.9% GetCoeffs / vp8_decode* functions
2.8% tokenize1st_order_b / vp8_tokenize_mb
2.3% vp8_sub_pixel_variance16x16_neon_func
2.1% vp8_fast_quantize_b_neon
2.0% vp8_pick_inter_mode
1.8% poll / g_poll / g_main_context_poll
1.8% vp8_sixtap_predict16x16_neon
1.6% aes_encrypt
And so forth, with many more VP8 functions accumulating to a significant amount of CPU time.
UPDATE:
Just analyzed a little bit more. Looking at thread-specific consumers: a large part is consumed by VP8 (~44% encoding + ~10% decoding), OpenGL overhead (~13%), crypto and polling (~9%), misc gstreamer stuff (>9%), orc (~4%), adding up to around 100% saturation by the entire pipeline.
I forced the Chrome end on 320x240 to make sure the decoding wasn't a bottleneck here. It still seems the encoding is taking an awful lot of time. I double checked the frame resolution in gst_vp8_enc_handle_frame
which appears to be 352x288 @ 25 Hz as expected.
I just noticed CONFIG_REALTIME_ONLY in libvpx is set to 0 instead of 1, this can't be good? It invokes several functions that are in the top time consumers.
See: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-reviews/GGUdn7giyvs
UPDATE:
Reduced encoder CPU time percentage from ~44% to ~32% by setting CONFIG_REALTIME_ONLY and CONFIG_ONTHEFLY_BITPACKING to 1 in the libvpx config.
@ikonst I just tested your script and it seems to work good. I am having a hard time though to find a place to break the debugger and get a GstElement that is part of the pipeline (for NativeDemo). Do you have any suggestions when to break and which GstElement to pick?
Seeing as NativeDemo uses openwebrtc-ios-sdk, I'd inspect e.g. OwrTransportAgent *transport_agent
and OwrVideoRenderer *renderer
in OpenWebRTCNativeHandler.m.
Here's my cheatsheet:
// transport_agent is some OwrTransportAgent*
gst_dot transport_agent->priv->transport_bin
// source is some OwrMediaSource*
gst_dot source->priv->source_bin
// renderer is some OwrMediaRenderer*
gst_dot renderer->priv->pipeline
Possibly OpenWebRTC could be changed to make such introspection more convenient :)
Reduced encoder CPU time percentage from ~44% to ~32% by setting CONFIG_REALTIME_ONLY and CONFIG_ONTHEFLY_BITPACKING to 1 in the libvpx config.
!!!
Some optimization remarks:
gst_inter_video_src_create
, non-writeable buffers. These must be writeable as this is causing a significant and unnecessary performance drag on the pipeline.gst_video_frame_copy_plane
.Nice improvements!
I agree with point 1.
Point 2 may need glFlush to trigger rendering at close to the right time. I guess it ideally shouldn't be called more than necessary.
Point 3 needs more details - where are the copies happening? Is it just as a consequence to 1?
I think point 4 is wrong. mach_msg_trap seems to suggest that threads are waiting on each other. That is in line with what I said about at least one thread not being able to run in real time and latency increasing.
Normally what you would see if a pipeline split into a few threads. If one thread is processing, when the thread before it in the pipeline has processed buffers filling the queue at the end of it, it has to sit and wait for free space. So it has to wait for the following thread to take a buffer from the queue.
Can you easily see which thread(s) are maxed out? Or the backtrack of that mach_msg_trap? That could provide clues before I can get hold of your debug log to analyze.
I think https://github.com/EricssonResearch/openwebrtc/issues/240 is the key for eradicating memcpies.
Some of the points ijsf is raising, were actually brought up by me on IRC yesterday:
Point 2 may need glFlush to trigger rendering at close to the right time. I guess it ideally shouldn't be called more than necessary.
Actually, it's no more than a hunch. From Apple's [OpenGL ES Design Guidelines](https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/OpenGLESApplicationDesign/OpenGLESApplicationDesign.html:
On some desktop OpenGL implementations, it can be useful to periodically call the glFlush function to efficiently balance CPU and GPU work, but this is not the case in iOS.
I'm naturally cautious of the current GStreamer code being too desktop-oriented, e.g. where energy was never a concern, but I didn't yet for a fact prove that any glFlush
es are redundant.
Generally we should strive so that our "selfie" demo has same (or better) performance as a simple "selfie" iOS app that gets captureOutput:didOutputSampleBuffer:fromConnection:
callbacks and draws using Core Image on the GL context.
Point 3 needs more details - where are the copies happening? Is it just as a consequence to 1?
I need to look into it. All I see is that gst_vtenc_encode_frame
leads to gst_video_frame_copy
. Probably need to build without -Os to see the missing functions.
I think point 4 is wrong. mach_msg_trap seems to suggest that threads are waiting on each other. That is in line with what I said about at least one thread not being able to run in real time and latency increasing.
This, too, is just a hunch now. I was just wondering how much of GStreamer's thread-safety addresses a scenario that's not interesting to us. It's akin to a using thread-safe malloc
in a single-threaded app, but incurring all the costs for the locking.
I need to look into it. All I see is that gst_vtenc_encode_frame leads to gst_video_frame_copy. Probably need to build without -Os to see the missing functions.
LOL, there were no missing functions. gst_vtenc_encode_frame
calls gst_video_frame_copy
directly.
Comment says:
/* FIXME: iOS has special stride requirements that we don't know yet.
* Copy into a newly allocated pixelbuffer for now. Probably makes
* sense to create a buffer pool around these at some point.
*/
So, before we give our raw image data to the H.264 encoder:
CVPixelBufferCreateWithPlanarBytes
to create a VCPixelBuffer out of an existing buffer. (Does not alloc or copy)CVPixelBufferCreate
(which mallocs -- rather cheap but not free) and then copy (expensive!) our image data into it.That comment sums it up then. Buffer pool for the win. :smile:
@superdump - the malloc is not what's expensive, it's the memcpy. The FIXME should be fixed.
@alessandrod - are you aware of the above?
@ikonst - it can be that a buffer pool avoids the need to copy, but the comment makes it sounds like it's necessary. It sounds a bit odd though, and in our case for a two-way video chat, the incoming video is captured by the camera and one would hope it would be in the same format as can be consumed by the encoder.
@superdump - yes, currently GstAVFVideoSrc produces NV12. Since GstAVFVideoSrc doesn't seem to do anything funny (seeing avfvideosrc.m -- grep for gst_core_media_buffer_new
), I would tend to agree with you.
That FIXME was authored by @sdroege , btw :)
I guess the use of glFlush could be explained by gstreamer is dealing with multiple GL contexts and has to sync between them. I'm also seeing excessive calls to things such as glDeleteTextures though.
And indeed, so far we've had two FIXME's having detrimental effects on (iOS) performance or stability..
For vtenc, a buffer pool that directly provides the buffers from vtenc upstream would solve the copy, yes. I just didn't do that back then because I just had to get it working and had too many other things on my list to worry about it for now :)
For libgstgl... it's not really "desktop oriented", and was never intended too :) That said, there might be lots of places where performance can be improved. We just have to find them and see if it's actually safe to do in all circumstances.
@sdroege In vtenc, you claim that you copy because of a different stride requirement. I don't get how a buffer pool has anything to do with that.
Also, I wrote a basic iOS app shoving samples from AV capture into VT encoder and it works just fine. It's clear that we're doing here work that we can refrain from doing.
(My last comment might've come off too aggressive. I didn't mean it to be. Anyway, I hope to look into that vtenc code in the next days.)
The comment might be a bit confusing. What I meant there is:
Can upstream software (and hardware) elements write into CVPixelBuffers without issue? If so and if I understood correctly, it would seem that a buffer pool of CVPixelBuffers would be most effective. Is that correct?
Depends on the elements. basetransform elements all will at least... vtdec and avfvideosrc probably can't, but for those we can in theory directly use the CVPixelBuffers we get from them inside vtenc.
... wraps around a CVPixelBufferPool and provides those to upstream elements.
The following code works great on iOS:
-(void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection {
CVPixelBufferRef pixelBuffer = (CVPixelBufferRef)CMSampleBufferGetImageBuffer(sampleBuffer);
CMTime pts = CMSampleBufferGetPresentationTimeStamp(sampleBuffer);
CMTime dur = CMTimeMake(1, 30);
VTCompressionSessionEncodeFrame(_compressionSession, pixelBuffer, pts, dur, NULL, NULL, NULL);
}
The capture session produces sample buffers that VT is happy to accept. If we do something to them in the process, it's our fault.
If we do something to them in the process that makes them no longer a CVPixelBuffer. :smile: I expect we can do efficient operations on CVPixelBuffers.
in theory directly use the CVPixelBuffers we get from them inside vtenc
This can be done by taking out the "meta" from the GstBuffer, right?
Supposedly doing this would also be more efficient cause it'll spare a CVPixelBuffer allocation and construction.
So, as I understand, we're doing something wrong when reconstructing the CVPixelBuffer in vtenc. To fix it:
Summarizing multiple related problems into this ticket for now.
I have been testing the local video capturing in the iOS build of OpenWebRTC on an iPad Mini Gen 1 (armv7) using demo.openwebrtc.io:38080.
A-side: iOS 8.1.3, iPad Mini Gen 1 B-side: Chrome 41.0.2272.76 (64-bit) on OS X 10.10
Currently, there are some serious quality issues with the video that have to be solved in order for OpenWebRTC to be useful for video conferencing: