Ensuring proper iOS audio/video quality for video conferencing

ijsf commented 9 years ago

Summarizing multiple related problems into this ticket for now.

I have been testing the local video capturing in the iOS build of OpenWebRTC on an iPad Mini Gen 1 (armv7) using demo.openwebrtc.io:38080.

A-side: iOS 8.1.3, iPad Mini Gen 1 B-side: Chrome 41.0.2272.76 (64-bit) on OS X 10.10

Currently, there are some serious quality issues with the video that have to be solved in order for OpenWebRTC to be useful for video conferencing:

A-side to B-side (video received on B-side): permanently green. Cause unknown.
B-side to A-side (video received on A-side): occasionally drops out to black/green, possibly to do with poor (real-time) performance in combination with RTP (RTP attempting to resync?). See https://github.com/EricssonResearch/openwebrtc/issues/244

stefanalund commented 9 years ago

Is the OpenWebRTC app Bowser or a native app?

ijsf commented 9 years ago

Bowser. Will definitely test this with the native app.

stefanalund commented 9 years ago

I would be very interested in any performance comparison that you do. Thanks!

ijsf commented 9 years ago

NativeDemo (with stefanalund's framework):

iOS -> Chrome video works, albeit quite slow (2 Hz or so). This is expected since this is your build (and not my optimized one).
Chrome -> iOS doesn't work. An assertion "gst_buffer_copy_into: assertion 'gst_buffer_is_writable (dest)' failed" is thrown seemingly for every video frame.

NativeDemo (with my current framework):

iOS -> Chrome video green.
Chrome -> iOS doesn't work. An assertion "gst_buffer_copy_into: assertion 'gst_buffer_is_writable (dest)' failed" is thrown seemingly for every video frame.

stefanalund commented 9 years ago

@ijsf sorry but the framework I sent you before is not the one I got working. Here is the one that I have used to get audio and video working in both directions (owr<->Chrome, owr<->owr): http://static.verkstad.net/OpenWebRTC.framework_old.zip

This framework build is from Feb 28 and still requires a lot of optimizations. I was testing on iPhone 6+ and iPhone 5s (both on iOS 8.2).

Let me know if you also get video in both directions using this older framework.

ijsf commented 9 years ago

I'm seeing 1 frame of data, then the iOS app crashes with a EXC_BAD_ACCESS in one of the vp8 functions (variance16x16_neon_loop). This may be an old libvpx issue, e.g. https://github.com/EricssonResearch/cerbero/commit/174661c6ebd8fe796d604280fd3d4ddcdba371a2 or https://code.google.com/p/webm/issues/detail?id=737

Just tested with a clean checkout of OpenWebRTC, some partially good news:

iOS -> Chrome: works, though slightly too slow (faster than before with Bowser) + significant latency/delay.
Chrome -> iOS: does not work, grey. Local video also doesn't seem to work.

(This indicates that the green video issue I posted before may be caused by my orc changes.)

stefanalund commented 9 years ago

Wouldn't I have hit the same problem if there was a problem in libvpx? Unless it happens under higher CPU load.

ijsf commented 9 years ago

Trying to find out more information.

In the meantime, I'm also getting GL related errors and no video output on iOS. I was also having issues with this in pererikb's GL Bowser patches:

** Message: Got remote source: Remote video stream
0:00:32.689136000   779 0x15ac4068 WARN               gldisplay gstgldisplay.c:188:gst_gl_display_new: Could not create display. user specified (NULL) (platform: (NULL)), creating dummy

ijsf commented 9 years ago

@stefanalund

First of all, it seems I didn't test the green video bug correctly with/without my orc changes. I found a critical bug in orc's current NEON implementation causing all sorts of nasty issues, which I fixed in my fork at https://github.com/ijsf/OpenWebRTC-orc/commit/5e8ad19de2445d9d3611e241f317927ba54e8e56. The green video seems to have been caused by this bug. With the fix (and thus working orc NEON assembly implementation) the performance (even without any compiler optimization) is starting to become very good.

Secondly, the local iOS video display is still not working due to the above GL issue. Not sure what's going on there. Any idea?

UPDATE:

Actually, could this be caused by the EAGL code in gstreamer using OpenGL ES 3.x instead of 2.0? This would explain why it is not working on my device (not OpenGL ES 3.x compatible) but is on newer iPhones (OpenGL ES 3.x compatible).

I just tested with OpenGL ES Analyzer, and found out various OpenGL calls are failing in the gstreamer GL plugin:

180 0x1774acd0  0x75d37 0x0 [0x1774acd0 renderbufferStorage:GL_RENDERBUFFER fromDrawable:NULL]
181 0x1774acd0  0x75d37 0x0 GL_INVALID_OPERATION <- glGetRenderbufferParameteriv(GL_RENDERBUFFER, GL_RENDERBUFFER_WIDTH, {86278144})
182 0x1774acd0  0x75d37 0x0 GL_INVALID_OPERATION <- glGetRenderbufferParameteriv(GL_RENDERBUFFER, GL_RENDERBUFFER_HEIGHT, {86278144})
183 0x1774acd0  0x75d37 0x0 glBindRenderbuffer(GL_RENDERBUFFER, 0u)
184 0x1774acd0  0x75d37 0x0 GL_INVALID_OPERATION <- glRenderbufferStorage(GL_RENDERBUFFER, GL_DEPTH_COMPONENT16, 86278144, 86278144)

It seems that the EAGL layers passed into glBindRenderBuffer and fromDrawable are NULL, somehow.

UPDATE:

Tracked down the OpenGL issue, see: https://github.com/EricssonResearch/openwebrtc/issues/283 - with the appropriate fix (+ my orc changes) the received video on iOS seems to be very smooth.

superdump commented 9 years ago

\o/ Great! So, what's next? Adding armv8-a support or whatever it is called in Xcode?

superdump commented 9 years ago

Maybe you want to close this issue and use a specific one for tracking the orc work. Maybe it's an idea to submit that work to upstream orc before continuing with arm64 work? Then you can get it off your plate so to speak. :)

stefanalund commented 9 years ago

AWESOME work! Can't wait to have this land in GStreamer and eventually OpenWebRTC.

ikonst commented 9 years ago

Regarding:

WARN               gldisplay gstgldisplay.c:188:gst_gl_display_new: Could not create display. user specified (NULL) (platform: (NULL)), creating dummy

This is a bogus warning. The EAGL implementation just does not use gst_gl_display. @sdroege - this is a warning that happens always in the normal course of operation. It should probably be eliminated, no?

Update: Filed Bug 746933 - gldisplay: EAGL "Could not create display"

ijsf commented 9 years ago

Well, there still seems to be one significant open issue with video on iOS making it useful for video conferencing:

The received video on iOS (Chrome -> iOS) is good, performance-wise and latency-wise.
The local video on iOS is good, performance-wise and latency-wise.
The received video on Chrome (iOS -> Chrome) is reasonably good performance-wise, but has a latency of approximately 5 seconds and skips frames every now and then.

In short, the latency is still very bad, and this is not due to real-time or performance issues: CPU utilization is actually <100% (video only, no audio) and I am testing with @stefanalund's NativeDemo. I am not seeing any gst WARNs during calling, except for two right after initializing the call as we've seen earlier in https://github.com/EricssonResearch/openwebrtc/issues/236:

0:00:17.694641000  1004 0x15253638 WARN         dtlssrtpdemux gstdtlssrtpdemux.c:137:sink_chain:<dtls-srtp-demux> received invalid buffer: 1
0:00:21.654848000  1004 0x15253638 WARN         rtpjitterbuffer rtpjitterbuffer.c:449:calculate_skew: resync to time 0:00:07.190769000, rtptime 3:36:37.835966666

What's going on here? Maybe this is causing the latency? Maybe @sdroege knows more about this particular issue?

@superdump: Don't really want to go through the administrative hoops of submitting stuff upstream outside of Github right now, if you don't mind ;) Will get to that at a later point. As far as orc is concerned, 64-bit NEON support would be a high priority issue. Don't really have the arm64/AArch64 hardware, so I am hoping somebody else could look at this asap.

superdump commented 9 years ago

The DTLS invalid buffer is due to packets received before handshaking is complete I think.

The clock skew recalculation is probably just an initial skew calculation I would guess but not sure.

Can you check what resolution, framerate and bitrate you're using? Of course 5 second latency is crazy, hopefully it's something pretty obviously wrong. I wonder if it's just something like queuing up video until the DTLS handshake is complete and then sending data or something like that. Except that it never catches up.

@sdroege - is there a good way to get GST_DEBUG=6 logs from iOS devices with minimal performance impact? I.e. Writing to a file and copying it over somehow?

ijsf commented 9 years ago

@superdump I got the following information from NativeDemo:

2015-03-28 17:46:08.188 NativeDemo[1004:509913] Video settings: {
    bitrate = 1024000;
    framerate = 25;
    height = 480;
    mode = VGA;
    width = 640;
}

Video is encoded with VP8, clockrate 90000.

stefanalund commented 9 years ago

I've made those parameters configureable, go to Settings -> NativeDemo to modify.

stefanalund commented 9 years ago

Would be interesting to see if a really low resolution (or frame rate) helps with the latency issue.

ijsf commented 9 years ago

QCIF and QVGA are unusable and result in an oddly shaped black frame on the receiving end in Chrome.

CIF works and reduces the latency to 0.5 s or so initially, however latency gradually increases to several seconds and longer.

(Latency on the receiving end in iOS is initially great but also gradually increases to several seconds and more.)

superdump commented 9 years ago

As there is a dependence on resolution, it sounds like some thread is not able to complete processing in all elements in real time. For each thread, all processing for one frame must complete in at most one frame duration in order for real time processing. We can split the processing pipelines into more threads using queue elements. What we need to know now is how long each element that does processing spends processing each buffer.

Can you make a log with the following environment variables and paste it into a gist then link it here?

GST_DEBUG=GST_PADS:6,GST_ELEMENT_PADS:6,GST_SCHEDULING:7 GST_DEBUG_NO_COLOR=1

The GST_*PADS stuff is to figure out how the elements in the pipelines are connected together. The GST_SCHEDULING part prints a message every time a buffer is passed to the next element's chain function or so, which allows you to see when a buffer enters and exits an element. With this one can figure out where the bottlenecks are, what needs optimizing and where queues may be needed.

Insertion of queues causes the chain before the queue and after it to run in separate threads. This does not in itself decrease latency but it does allow more throughput by parallelising portions of the processing.

ijsf commented 9 years ago

@superdump Unfortunately this makes the app unusable due to the pedantic output.

superdump commented 9 years ago

Right, the rendering of the log output in Xcode is really slow. What normally works is to be able to write the log output to a file. Is there some writable path on iOS? I think you used to be able to do that. If so, you can also set GST_DEBUG_FILE to the path of the file you want the gst debug log to be written and then copy it over afterwards.

ikonst commented 9 years ago

Use something like:

setenv("GST_DEBUG_FILE", [[NSTemporaryDirectory() stringByAppendingPathComponent:@"foo.txt"] cStringUsingEncoding:NSUTF8StringEncoding], 1);

Then retrieve (the whole app container) like this: http://stackoverflow.com/a/28161494/130066

superdump commented 9 years ago

Good to know, thanks! @sdroege ^

@ijsf - maybe it's easier to dump pipeline graphs then to see how stuff is linked together. GST_DEBUG_DUMP_DOT_DIR should be set to the NSTemporaryDirectory

ikonst commented 9 years ago

I'll use the opportunity to plug my Python thing for lldb: https://gist.github.com/ikonst/6874ff814ab7f2530b2a

Basically, I wrote it since after you setenv("GST_DEBUG_DUMP_DOT_DIR", [NSTemporaryDirectory() cStringUsingEncoding:NSUTF8StringEncoding], 1), you still need to retrieve the container each time from your iOS device, and that's really annoying.

Instead, you just do gst_dot some_gst_object in your lldb prompt.

The Python code just retrieves it from the running process, saves the .dot locally and opens it in Graphviz. (As a bonus, for every GstObject it goes to the top of the hierarchy, i.e. to the bin.)

ikonst commented 9 years ago

@ijsf - I'd like to see where your improvements take us. If you create a simple "self-view" pipeline, as done in the "Selfie" sample, make sure you're getting:

silky-smooth performance with full frame rate and no noticeable delay
12% CPU usage on iPhone 5s

For reference, that's what you get on iPhone 5s with a simple iOS app (~20 lines of code, basically following the AVFoundation guide), capturing 1280x720 video and drawing each 32BGRA frame with Core Image unto an OpenGL layer.

Also for reference, H264-loopback iOS app which captures, encodes, decodes (via VideoToolbox) and renders takes up ~24% CPU on iPhone 5s. Understandably my app has no networking, while an OWR loopback (i.e. "test_send_receive" sample) sample has the additional overhead of networking (and also crypto and audio, if you choose to keep them enabled). Yet, the CPU usage I've been seeing is closer to 70%.

ijsf commented 9 years ago

@ikonst I just tested your script and it seems to work good. I am having a hard time though to find a place to break the debugger and get a GstElement that is part of the pipeline (for NativeDemo). Do you have any suggestions when to break and which GstElement to pick?

As far as performance goes.. since a call at CIF resolution is also accumulating latency (note that this is even without the whole audio pipeline) I assume there's a serious issue in the gstreamer pipeline somewhere.

The top consumers irregardless of thread with CIF resolution on iOS according to Time Profiler during a two-way video-only call are:

9.2% mach_msg_trap / mach_msg / io_connect_method in IOKit. <- this is iOS' OpenGL pipeline overhead
5.5% vp8_pack_tokens
3.3% _platform_memmove / gst_buffer_copy_into / gst_mini_object_make_writeable / gst_inter_video_src_create <-- seems very wasteful
2.9% GetCoeffs / vp8_decode* functions
2.8% tokenize1st_order_b / vp8_tokenize_mb
2.3% vp8_sub_pixel_variance16x16_neon_func
2.1% vp8_fast_quantize_b_neon
2.0% vp8_pick_inter_mode
1.8% poll / g_poll / g_main_context_poll
1.8% vp8_sixtap_predict16x16_neon
1.6% aes_encrypt

And so forth, with many more VP8 functions accumulating to a significant amount of CPU time.

UPDATE:

Just analyzed a little bit more. Looking at thread-specific consumers: a large part is consumed by VP8 (~44% encoding + ~10% decoding), OpenGL overhead (~13%), crypto and polling (~9%), misc gstreamer stuff (>9%), orc (~4%), adding up to around 100% saturation by the entire pipeline.

I forced the Chrome end on 320x240 to make sure the decoding wasn't a bottleneck here. It still seems the encoding is taking an awful lot of time. I double checked the frame resolution in gst_vp8_enc_handle_frame which appears to be 352x288 @ 25 Hz as expected.

ijsf commented 9 years ago

I just noticed CONFIG_REALTIME_ONLY in libvpx is set to 0 instead of 1, this can't be good? It invokes several functions that are in the top time consumers.

See: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-reviews/GGUdn7giyvs

UPDATE:

Reduced encoder CPU time percentage from ~44% to ~32% by setting CONFIG_REALTIME_ONLY and CONFIG_ONTHEFLY_BITPACKING to 1 in the libvpx config.

ikonst commented 9 years ago

@ikonst I just tested your script and it seems to work good. I am having a hard time though to find a place to break the debugger and get a GstElement that is part of the pipeline (for NativeDemo). Do you have any suggestions when to break and which GstElement to pick?

Seeing as NativeDemo uses openwebrtc-ios-sdk, I'd inspect e.g. OwrTransportAgent *transport_agent and OwrVideoRenderer *renderer in OpenWebRTCNativeHandler.m.

Here's my cheatsheet:

// transport_agent is some OwrTransportAgent*
gst_dot transport_agent->priv->transport_bin
// source is some OwrMediaSource*
gst_dot source->priv->source_bin
// renderer is some OwrMediaRenderer*
gst_dot renderer->priv->pipeline

Possibly OpenWebRTC could be changed to make such introspection more convenient :)

ikonst commented 9 years ago

Reduced encoder CPU time percentage from ~44% to ~32% by setting CONFIG_REALTIME_ONLY and CONFIG_ONTHEFLY_BITPACKING to 1 in the libvpx config.

!!!

ijsf commented 9 years ago

Some optimization remarks:

gst_inter_video_src_create, non-writeable buffers. These must be writeable as this is causing a significant and unnecessary performance drag on the pipeline.
Poorly written GL renderer, e.g. calls using glFlush must be removed.
Unnecessary copies in the pipline, e.g. gst_video_frame_copy_plane.
Possibly too much thread-safety locking going on.

superdump commented 9 years ago

Nice improvements!

I agree with point 1.

Point 2 may need glFlush to trigger rendering at close to the right time. I guess it ideally shouldn't be called more than necessary.

Point 3 needs more details - where are the copies happening? Is it just as a consequence to 1?

I think point 4 is wrong. mach_msg_trap seems to suggest that threads are waiting on each other. That is in line with what I said about at least one thread not being able to run in real time and latency increasing.

Normally what you would see if a pipeline split into a few threads. If one thread is processing, when the thread before it in the pipeline has processed buffers filling the queue at the end of it, it has to sit and wait for free space. So it has to wait for the following thread to take a buffer from the queue.

Can you easily see which thread(s) are maxed out? Or the backtrack of that mach_msg_trap? That could provide clues before I can get hold of your debug log to analyze.

superdump commented 9 years ago

I think https://github.com/EricssonResearch/openwebrtc/issues/240 is the key for eradicating memcpies.

ikonst commented 9 years ago

Some of the points ijsf is raising, were actually brought up by me on IRC yesterday:

Point 2 may need glFlush to trigger rendering at close to the right time. I guess it ideally shouldn't be called more than necessary.

Actually, it's no more than a hunch. From Apple's [OpenGL ES Design Guidelines](https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/OpenGLESApplicationDesign/OpenGLESApplicationDesign.html:

On some desktop OpenGL implementations, it can be useful to periodically call the glFlush function to efficiently balance CPU and GPU work, but this is not the case in iOS.

I'm naturally cautious of the current GStreamer code being too desktop-oriented, e.g. where energy was never a concern, but I didn't yet for a fact prove that any glFlushes are redundant.

Generally we should strive so that our "selfie" demo has same (or better) performance as a simple "selfie" iOS app that gets captureOutput:didOutputSampleBuffer:fromConnection: callbacks and draws using Core Image on the GL context.

Point 3 needs more details - where are the copies happening? Is it just as a consequence to 1?

I need to look into it. All I see is that gst_vtenc_encode_frame leads to gst_video_frame_copy. Probably need to build without -Os to see the missing functions.

I think point 4 is wrong. mach_msg_trap seems to suggest that threads are waiting on each other. That is in line with what I said about at least one thread not being able to run in real time and latency increasing.

This, too, is just a hunch now. I was just wondering how much of GStreamer's thread-safety addresses a scenario that's not interesting to us. It's akin to a using thread-safe malloc in a single-threaded app, but incurring all the costs for the locking.

ikonst commented 9 years ago

I need to look into it. All I see is that gst_vtenc_encode_frame leads to gst_video_frame_copy. Probably need to build without -Os to see the missing functions.

LOL, there were no missing functions. gst_vtenc_encode_frame calls gst_video_frame_copy directly.

Comment says:

    /* FIXME: iOS has special stride requirements that we don't know yet.
     * Copy into a newly allocated pixelbuffer for now. Probably makes
     * sense to create a buffer pool around these at some point.
     */

So, before we give our raw image data to the H.264 encoder:

On OS X, we use CVPixelBufferCreateWithPlanarBytes to create a VCPixelBuffer out of an existing buffer. (Does not alloc or copy)
On iOS, due to the comment above, we use CVPixelBufferCreate (which mallocs -- rather cheap but not free) and then copy (expensive!) our image data into it.

superdump commented 9 years ago

That comment sums it up then. Buffer pool for the win. :smile:

ikonst commented 9 years ago

@superdump - the malloc is not what's expensive, it's the memcpy. The FIXME should be fixed.

superdump commented 9 years ago

@alessandrod - are you aware of the above?

superdump commented 9 years ago

@ikonst - it can be that a buffer pool avoids the need to copy, but the comment makes it sounds like it's necessary. It sounds a bit odd though, and in our case for a two-way video chat, the incoming video is captured by the camera and one would hope it would be in the same format as can be consumed by the encoder.

ikonst commented 9 years ago

@superdump - yes, currently GstAVFVideoSrc produces NV12. Since GstAVFVideoSrc doesn't seem to do anything funny (seeing avfvideosrc.m -- grep for gst_core_media_buffer_new), I would tend to agree with you.

That FIXME was authored by @sdroege , btw :)

ijsf commented 9 years ago

I guess the use of glFlush could be explained by gstreamer is dealing with multiple GL contexts and has to sync between them. I'm also seeing excessive calls to things such as glDeleteTextures though.

And indeed, so far we've had two FIXME's having detrimental effects on (iOS) performance or stability..

sdroege commented 9 years ago

For vtenc, a buffer pool that directly provides the buffers from vtenc upstream would solve the copy, yes. I just didn't do that back then because I just had to get it working and had too many other things on my list to worry about it for now :)

For libgstgl... it's not really "desktop oriented", and was never intended too :) That said, there might be lots of places where performance can be improved. We just have to find them and see if it's actually safe to do in all circumstances.

ikonst commented 9 years ago

@sdroege In vtenc, you claim that you copy because of a different stride requirement. I don't get how a buffer pool has anything to do with that.

Also, I wrote a basic iOS app shoving samples from AV capture into VT encoder and it works just fine. It's clear that we're doing here work that we can refrain from doing.

ikonst commented 9 years ago

(My last comment might've come off too aggressive. I didn't mean it to be. Anyway, I hope to look into that vtenc code in the next days.)

sdroege commented 9 years ago

The comment might be a bit confusing. What I meant there is:

We can't just use the buffers we get from upstream (like we do on OSX), because then the codec just fails with a completely undescriptive error (my guess -> stride mismatch)
We can allocate CVPixelBuffers for the codec and then copy stuff in there, that works
So to make things work better, we could provide a GstBufferPool from vtenc that either 1) allocates normal GStreamer buffers with the accepted strides if someone can guess that, or 2) that wraps around a CVPixelBufferPool and provides those to upstream elements.

superdump commented 9 years ago

Can upstream software (and hardware) elements write into CVPixelBuffers without issue? If so and if I understood correctly, it would seem that a buffer pool of CVPixelBuffers would be most effective. Is that correct?

sdroege commented 9 years ago

Depends on the elements. basetransform elements all will at least... vtdec and avfvideosrc probably can't, but for those we can in theory directly use the CVPixelBuffers we get from them inside vtenc.

ikonst commented 9 years ago

... wraps around a CVPixelBufferPool and provides those to upstream elements.

The following code works great on iOS:

-(void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection {
    CVPixelBufferRef pixelBuffer = (CVPixelBufferRef)CMSampleBufferGetImageBuffer(sampleBuffer);
    CMTime pts = CMSampleBufferGetPresentationTimeStamp(sampleBuffer);
    CMTime dur = CMTimeMake(1, 30);

    VTCompressionSessionEncodeFrame(_compressionSession, pixelBuffer, pts, dur, NULL, NULL, NULL);
}

The capture session produces sample buffers that VT is happy to accept. If we do something to them in the process, it's our fault.

superdump commented 9 years ago

If we do something to them in the process that makes them no longer a CVPixelBuffer. :smile: I expect we can do efficient operations on CVPixelBuffers.

ikonst commented 9 years ago

in theory directly use the CVPixelBuffers we get from them inside vtenc

This can be done by taking out the "meta" from the GstBuffer, right?

Supposedly doing this would also be more efficient cause it'll spare a CVPixelBuffer allocation and construction.

So, as I understand, we're doing something wrong when reconstructing the CVPixelBuffer in vtenc. To fix it:

Whenever possible (i.e. when getting our GstBuffers from a fellow Apple element), avoid reconstructing the CVPixelBuffer.
Fix the reconstruction of the CVPixelBuffer (without resorting to memory-copying).

EricssonResearch / openwebrtc

Ensuring proper iOS audio/video quality for video conferencing #260