Ensuring proper iOS audio/video quality for video conferencing

ijsf commented 9 years ago

Summarizing multiple related problems into this ticket for now.

I have been testing the local video capturing in the iOS build of OpenWebRTC on an iPad Mini Gen 1 (armv7) using demo.openwebrtc.io:38080.

A-side: iOS 8.1.3, iPad Mini Gen 1 B-side: Chrome 41.0.2272.76 (64-bit) on OS X 10.10

Currently, there are some serious quality issues with the video that have to be solved in order for OpenWebRTC to be useful for video conferencing:

A-side to B-side (video received on B-side): permanently green. Cause unknown.
B-side to A-side (video received on A-side): occasionally drops out to black/green, possibly to do with poor (real-time) performance in combination with RTP (RTP attempting to resync?). See https://github.com/EricssonResearch/openwebrtc/issues/244

sdroege commented 9 years ago

Well, it will also currently not work properly because we use intervideo* and didn't implement my plan yet :) But other than that, what you describe there would and should work.

ikonst commented 9 years ago

Oh, I've just noticed that we in fact first try to get a CVPixelBufferRef from frame->input_buffer, and only failing that we resort to reconstructing it.

Are we losing the meta due to the intervideo*?

  meta = gst_buffer_get_core_media_meta (frame->input_buffer);
  if (meta != NULL) {
    pbuf = gst_core_media_buffer_get_pixel_buffer (frame->input_buffer);
  }
#ifdef HAVE_IOS
  if (pbuf == NULL) {
    ...
    /* FIXME: iOS has special stride requirements that we don't know yet.
     * Copy into a newly allocated pixelbuffer for now. Probably makes
     * sense to create a buffer pool around these at some point.
     */

ijsf commented 9 years ago

Will probably submit my orc patch upstream tomorrow.

superdump commented 9 years ago

Cool!

ikonst commented 9 years ago

Summing up my "iOS intolerable overhead" research so far:

== avfvideosrc ==

7% saving by increasing BUFFER_QUEUE_SIZE (https://bugzilla.gnome.org/show_bug.cgi?id=747270)
2% saving more by eliminating the buffer queue completely and pushing buffers from the AV capture dispatch queue. The dispatch queue is a publisher-consumer queue already, and adding our own publisher-consumer to bring the buffers back to the "pipeline thread" certainly doesn't help. Maybe g_async_queue would be more efficient than NSMutableArray? Though I doubt it, I think it's the wrong question. Especially on mobile platforms we cannot afford ourselves anything in the critical path that wastes energy.

I've done this in a hacky way; as stormer said on #gstreamer, doing it right requires re-doing GBaseSrc, but maybe it's not that hard. (Only part of what GBaseSrc is important to live sources.)
about 1% saving by setting enable-last-sample=0 on the last element — iOS really likes getting its CVPixelBuffers back in a timely fashion, and the "last sample" feature was retaining the CVPixelBuffer past the return from the AV callback
Proposal – applemedia currently has optimization for downstream element capable of GL textures, where it refrains from mapping the CVPixelBuffer. I propose another optimization: awareness of downstream element capable of working with a CVPixelBuffer. Even with GstCoreVideoMemory, this will spare all Gst*Memory allocations.

Summing up:

iOS pure AVFoundation test project at 12% CPU
avfvideosrc ! fakesink down from 24% to 16% CPU

== vtenc_h264 ==

Proposal – eliminate the GAsyncQueue, push the buffers right from the VT callback. We're having way too many implicit queues in the pipeline. (Same difficulties as with the avfvideosrc change discussed above.)

stefanalund commented 9 years ago

Out of curiosity: are these % improvements for this specific component or the CPU usage reduction in the full app? In any case, this is much appreciated!

ikonst commented 9 years ago

@stefanalund - my "iOS intolerable overhead" project is not running OWR, but rather trying to single out specific pipeline subsets one at a time, and test:

the GStreamer pipeline
a "textbook" iOS project (e.g. using purely AVFoundation) doing the same thing

ijsf commented 9 years ago

@superdump orc patch has been submitted at https://bugzilla.gnome.org/show_bug.cgi?id=742843#c19

ikonst commented 9 years ago

In https://bugzilla.gnome.org/show_bug.cgi?id=747352 – I've restored the iOS Core Video drawing as OpenGL textures. Performance is:

15% for front-camera (720p)
35% for back-camera (720p as well -- why such a difference?)

Both measurements done with my setjmp/longjmp avfvideosrc :-) If anyone is curious about it: https://gist.github.com/ikonst/750b67ae984944971cb0

superdump commented 9 years ago

@ijsf - excellent work!

@ikonst - and how do those compare to your iOS SDK test apps? Maybe the platform does more processing for one camera than the other?

ikonst commented 9 years ago

@superdump - just tested -- on my iOS tesp app, the front and back camera render both at 15% CPU with the following code:

-(void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection {
    [connection setVideoOrientation:AVCaptureVideoOrientationPortrait];
    CVPixelBufferRef pixelBuffer = (CVPixelBufferRef)CMSampleBufferGetImageBuffer(sampleBuffer);
    CIImage *image = [CIImage imageWithCVPixelBuffer:pixelBuffer];
    [self->coreImageContext drawImage:image inRect:[image extent] fromRect:[image extent]];
}

My thoughts:

1) It's amazing how much easier this is in the pure no-extra-SDK approach. 2) We probably do something crappy in glimagesink. 3) Our glimagesink is still broken for video rotation and there's no end in sight. Moreover, notice how easy it was in iOS — [connection setVideoOrientation:AVCaptureVideoOrientationPortrait]; 4) Even that's not something you'd typically do, since iOS has AVCaptureVideoPreviewLayer to render camera preview through the composition engine at zero CPU cost to you.

superdump commented 9 years ago

That's an odd case then. One more for the todo list. Thanks for the info.

GStreamer is a cross-platform, very flexible framework. I don't think it is at all surprising that it takes a bit of work to be able to work both generically and effectively within GStreamer. It is really worth it though. Your efforts are helping a lot to get there. Thank you.

1) It's not much code in a simple GStreamer app either when the elements are optimised like AVFoundation is.

2) Probably.

3) It's not broken, there are other GL elements that can do that. It's more modular. You can plug in effective GL rotation even if you're not displaying.

4) See 3), I think.

ikonst commented 9 years ago

3) There's gltransformation but it requires Graphene so we don't build it (yet?). On iOS, though, I've just realized orientation at-capture-time is hardware-accelerated. Just submitted - https://bugzilla.gnome.org/show_bug.cgi?id=747378

4) Oh, AVCaptureVideoPreviewLayer is a different beast. It keeps working even when your pause with a debugger, since it's done on the OS level.

superdump commented 9 years ago

3) cool.

4) Do Apple expose that for others to use?

ikonst commented 9 years ago

4) I guess it's only as customizable as the docs say. Maybe you can apply transforms on the layer -- I haven't tested.

It's the way to go when you're writing a photo app, or a video preview. Probably a more energy-efficient path than having callbacks in your app. If you need to instagram-process your video before displaying, Apple is kind enough to expose Core Image (and GL textures).

superdump commented 9 years ago

If you can keep everything on the GPU, you can use a GL shader with the glshader element.

ikonst commented 9 years ago

Ah, nice. Wish I knew more about writing shaders, though I guess something as simple as rotation is as easy as exchanging coordinates in a texture-mapping shader.

Anyway, Apple has setVideoOrientation: and if Android has its own thing, I guess we'll be good?

Whoever's an expert on GL, though, have a look at the Xcode OpenGL ES instrument -- it shows some inefficiencies with the glimagesink.

ijsf commented 9 years ago

Yeah, like I mentioned in https://github.com/EricssonResearch/openwebrtc/issues/260#issuecomment-87888912 there are some inefficiencies there, but I don't expect these to be significant.

ikonst commented 9 years ago

Testing latency on iOS:

Latency	Pipeline
33ms	`avfvideosrc \| fakesink`
33ms	`avfvideosrc ! video/x-raw(memory:GLMemory),format=RGBA,width=1280,height=720,framerate=30/1 ! glimagesink enable-last-sample=0`
66ms to 100ms	`avfvideosrc ! vtenc_h264`
100ms to 130ms	... `! h264parse disable-passthrough=TRUE`
100ms to 130ms	... `! rtph264pay`
100ms to 130ms	... `! rtpbin.send_rtp_sink_0`

(And upon trying to insert " queue ! " in there, latency jumprockets.)

Anyway, I think all of this is dubious and GStreamer MUST get a base source class for elements that can work in a callback-driven fashion from another thread – without queues and definitely without setjmp/longjmps :) It's perhaps a different take on a "pull" element, since the dispatch thread is "pulling" from the element -- but then pushing it onwards. Anyway, maybe a model like this might work.

About the queues, this is not my whim nor an iOS weirdness. I think it makes a lot of sense when you take a mobile CPU (2 cores) with a mobile OS, probably oriented towards energy efficiency. We clearly have the OS already doing queueing for us (libdispatch), and we come in and add another queue (merely to migrate to another thread, not for parallelization or anything). I don't think we should optimize, uh, things like "getting properties from elements", but as for the data pipeline, it should give every drop of performance — especially since GStreamer is a framework. Wasting CPU time frivolously is an app developer's prerogative.

The queue-elimination will apply to:

avfvideosrc
vtenc_h264
vtdec
osxaudio (I'm sure OSX and iOS already have their ring buffers, and we could use the zero-copy too...)

Actually in vtenc_h264, we have an async queue that does something even stranger. It will wait for the next frame most of the time, since it'll check the queue right after submitting the frame for compression.

Also, we're losing on hardware acceleration:

avfvideosrc would not GLMemory to glimagesink (and Matthew will not accept my patches) \ ... and even then, glimagesink CPU usage is higher and more erratic than a simple Core Image draw (in a boilerplate iOS app)
vtenc_h264 would only accept GLMemory with NV12 (why?) ... and even that if GLMemory is not broken, which it was on avfvideosrc, probably for months now...

Anyway, that's just my 2c.

superdump commented 9 years ago

@sdroege @alessandrod

superdump commented 9 years ago

How did you measure the latencies?

stefanalund commented 9 years ago

This bug surely affects overall performance https://github.com/EricssonResearch/openwebrtc/issues/96. Quite noticeable when running NativeDemo between two iPhones. In good lighting conditions the video runs pretty smooth, but as soon as the lighting is worse the video takes a big hit, including significant CPU increase.

ikonst commented 9 years ago

By periodically sending query_latency to the pipeline.

superdump commented 9 years ago

@ikonst - so that means that avfvideosrc reports 33ms latency (1 frame at 30fps I guess), vtenc_h264 reports 1-2 frames of latency (33-66ms), h264parse reports 1 frame of latency. The latencies are likely reported based on constraints of how the elements work rather than the actual amount of time they spend processing a buffer.

Also, which latency value is that from the query? They have specific meanings - http://cgit.freedesktop.org/gstreamer/gstreamer/tree/docs/design/part-latency.txt#n223 . The min-latency dictates the amount of latency used during synchronisation - http://cgit.freedesktop.org/gstreamer/gstreamer/tree/docs/design/part-latency.txt#n308 .

Now, note that sync is not enabled for nicesink as you want to send the data out to the network as soon as it is ready. Therefore the reported latencies do not really matter in the encoding chain as there is no latency compensation when sync=FALSE on the sink.

ijsf commented 9 years ago

@stefanalund I seriously doubt #96 has anything to do with performance. This bug has been fixed for a while in my fork as you can see in #243. The problem there is the min/max video capture framerate.

@ikonst As far as patches not being accepted upstream.. do you at least have the patches in your fork?

ikonst commented 9 years ago

@superdump Oh, I see now. So my methodology was flawed. I was actually wondering how come the latency numbers are so perfectly rounded :) Is there a way to ask how much time, on average, a buffer spends between source and sink?

Now, note that sync is not enabled for nicesink as you want to send the data out to the network as soon as it is ready.

I tried udpsink sync=0, and now video is nearly like Facetime, but audio is out of sync (assuming glimagesink and osxaudiosink are also sync=0 on the Mac side).

@ijsf I have my on-disk repository, and I upload git am-applicable patches to Bugzilla when I feel they're ready. But tbh, my incentive is to improve GStreamer.

superdump commented 9 years ago

You can see in a GST_DEBUG log of GST_SCHEDULING:7 when a buffer is passed from one element to the next. As far as I know, that is the only way to follow buffers through a pipeline. Note that messages start from the beginning of a thread with calling the next element's chain function and then when they get to the end of the thread they return saying that the chain function was called. The thread context pointer (right?) is also printed as part of each log message. You can look at the buffer pointers and timestamps and take differences to see how long a buffer spent in a specific element. There's some improved tracing work going on in GStreamer that will help this kind of end to end performance measurement. One can always write a script to parse the log, measure the differences and create graphs though.

With regard to sync, that's why RTP packets have timestamps and RTCP sender reports have a reference time for an RTP packet timestamp. These are used to keep RTP streams in sync. The evaluation of the behaviours of the remote clock according to the RTP stream and the local clock for a GStreamer receiver is done in rtp jitterbuffer.

About the patches that are not acceptable upstream - did you get any suggestions for how to fix the issues you're observing in The Right Way, given that your patches were rejected?

ijsf commented 9 years ago

@ikonst If you'd use a local fork as "playground" as I am doing we could turn this into a collaborative effort before things get pushed upstream where needed.

ijsf commented 9 years ago

Just noticed the use of libvpx deadlines in gstvp8enc.c and gstvp8dec.c, and the availability of the VPX_DL_REALTIME deadline setting in libvpx, which I believe is not used in OpenWebRTC.

VPX_DL_REALTIME sets the encoding + decoding deadlines to the minimum possible (1), this might help with performance issues. Will test now by hardcoding the deadlines to this value.

ijsf commented 9 years ago

EDIT:

I can see "deadline" is actually being set at https://github.com/EricssonResearch/openwebrtc/blob/master/transport/owr_payload.c#L397 for the encoder. This is correct.

The "deadline" at the decoder side is actually set at 0 and should be fixed to 1 instead.

ijsf commented 9 years ago

Appropriate patches: https://github.com/ijsf/OpenWebRTC-gst-plugins-good/commit/8c054bc7f5f8344ef171d58eca27d46dab824150, https://github.com/ijsf/openwebrtc/commit/534a370771619345e43f273bc755bbfac130a0f5 and https://bugzilla.gnome.org/show_bug.cgi?id=747534

Also found vp8 encoder discrepancies between WebRTC reference and those used in OpenWebRTC/gstreamer. I have corrected these here: https://github.com/ijsf/openwebrtc/commit/4976952f24a4ac41dd4a4f0728b06be3d9990140

Observations:

Performance (received video on iOS and received video on Chrome) seems very good.
Delay received video on iOS is small and no longer an issue.
Delay received video on Chrome is about 5 seconds (from the beginning on) and doesn't seem to be increasing.

As far as (3) goes, I'm still seeing warnings once at the beginning of the video call:

0:00:20.999593000   650  0x2036638 WARN           dtlssrtpdemux gstdtlssrtpdemux.c:137:sink_chain:<dtls-srtp-demux> received invalid buffer: 1
0:00:22.010325000   650  0x2036638 WARN           dtlssrtpdemux gstdtlssrtpdemux.c:137:sink_chain:<dtls-srtp-demux> received invalid buffer: 1
0:00:24.476797000   650  0x2036638 WARN         rtpjitterbuffer rtpjitterbuffer.c:449:calculate_skew: resync to time 0:00:07.192839000, rtptime 10:25:22.323788888

ijsf commented 9 years ago

Reduced OPUS load down even further by enabling the following options on opusenc:

audio-type: voice
complexity: 5 (webrtc.org default for mobile, opusenc defaults to 10)

See https://github.com/ijsf/openwebrtc/commit/b92556ab979ad80a70e9cd484a2a509e9a7d9e19 and https://github.com/ijsf/openwebrtc/commit/798151ec722e4b1072453f364aa2b9748cb91224.

Following improvements are still priority:

intervideo copy inefficiency: https://github.com/EricssonResearch/openwebrtc/issues/240
GL overhead

superdump commented 9 years ago

@ijsf how much difference does voice make? I'm wondering if it significantly reduces quality because it only enables the speech coding portion of Opus instead of also the transform portion. That should be very visible according to the bitrate I think. If so, we shouldn't use that unless it's for a voice only application. We can change the complexity though - is that a property?

superdump commented 9 years ago

Yes, it is a property. You can make a PR for the complexity part and I'll land it.

ijsf commented 9 years ago

@superdump Still waiting for the currently open pull request for VP8 to fall through.

I think the voice type is especially beneficial for cases where the audio quality is cranked down (e.g. for performance reasons). I've also seen celt/fft parts taking up a portion of the performance in the profiler, so assumed that this was a Good Thing to do. I agree though that this may be limiting for some WebRTC applications.

stefanalund commented 9 years ago

I would go for the "lower CPU load / lower quality" option at this point.

EricssonResearch / openwebrtc

Ensuring proper iOS audio/video quality for video conferencing #260