Open sdroege opened 9 years ago
Just a comment on current performance: I just tested the latest master of everything on iPad Mini 1, and performance is crawling Bowser to a near halt. Using Time Analyzer, the main bottlenecks seem to be the orc_* related functions involved with video (notably orc_executor_emulate). Not really surprising.
P.S. Another bottleneck seems to be Opus and all the audio analysis it's performing, but I believe that was mentioned in another issue already, as well as aes_encrypt taking some time.
I ran Bowser on an iPod Touch, and the top threads for CPU consumption were:
There are a number of issues to tackle here:
do you know if we're sending h264 in that case? I expect so... In which case maybe that isn't weird at all.
The memory copy in intervideosrc was surprising. I think zero copy and queue like behavior are really needed for those to have low latency and high performance/low power drain applications.
videoconvert is more prominent in Bowser because of the rendering mechanism which needs to convert both the self and remote views to BGRA to send as bitmaps to the WebView for rendering. In a real app using glimagesink or similar, those would not be visible. As such and in general, we should perhaps try to optimize the colour space of the video for encoders and rely on having fast colour space conversion in renderers.
As for audio, absolutely we should try to avoid conversions that are unnecessary. Noops for the win.
Correction: we're already on Opus 1.1, but it doesn't seem like we're actually using the appropriate optimisations. Will take a look at this.
Update: the recipe's already supposed to be setting things up to use intrinsics, so bumping investigating this down.
We're using vp8enc, which has 1.2% CPU usage. It's possible the frames just aren't reaching the encoder.
Or that they're perfect black frames?
intervideosrc and gst_buffer_make_writable() will only copy the underlying memory if it wasn't shareable, which is probably the problem here. We can't push non-writable buffers out there because we need to update the timestamps
So here's the full analysis.
Scenario: testing with demo.openwebrtc.io:38080, A-side: Bowser (master), B-side: Chrome (OS X 10.9.5), local LAN. Hardware: iPad Mini 1 (A5, armv7)
Remarks:
Time Analysis (running ~10 minutes, calling ~9 minutes):
(Excludes minor unavoidable functions involved with sockets and iOS libraries such as QuartzCore.)
Quite a few interesting bottlenecks apart from orc. What's also interesting is the fact that the video encoder itself doesn't really show up here, while audio and video conversion does seem to be a significant hit. Seems a little backwards.
http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/?id=08b1d27e14e28ed028d73de1e5f7ec2e9c0a0d71 this might fix the make_writable() copying in intervideosrc
- @ikonst is working on some stuff to correctly configure the source that should either eliminate or reduce (depending on what Core Audio decides to do) the audioresample stuff
- All of the audioconvert, videoconvert, orc stuff should be improved by fixing this issue
- Opus has improved optimisations for ARM and others on the way in their 1.1.1 release for which there is a beta. We updated to 1.1 recently I think.
- The two memmoves are due to the rendering method used in Bowser where we create a bitmap from the raw BGRA frame (_owr_image_renderer_pull_bmp_image) and then send it over to UIWebView for drawing (WebCore::SharedBuffer::createPurgeableBuffer)
I don't think there's so much left after that. :smile: Crypto pops up but maybe addressing https://github.com/EricssonResearch/openwebrtc/issues/199 will fix that
@ijsf you mentioned earlier that you were interested in helping out. How about looking at this issue (ORC)?
Sure, I'll try and have a look at this once my rebuild is done. I've forked the orc repository at https://github.com/ijsf/OpenWebRTC-orc
Currently working on adding a --static-implementation option in the above fork.
This fork is now capable of compiling static functions with inline asm code (and backup C functions for those that don't have an asm implementation). It's almost compilable.
The drawback is that I also had to clone gstreamer-common, the gstreamer and all gst* repositories, as common/orc.mak (which contains the arguments for orcc) is a submodule (gstreamer-common) in the gst* and gstreamer git repositories.
Furthermore, I found out there's also a "--target
So in short, the command-line needs to change from "... --implementation ..." to "... -target
I think there should be a configure parameter to select this: --enable-static-orc=targets where targets can be a list of targets.
Then orc.mak would do its thing depending on that.
This is done as far as I'm concerned.
@ijsf I'll keep it open until it lands and the above bugzilla ticket is closed. Should we create a separate issue about 64-bit ARM support? As far as I recall, that's a requirement for new App Store submissions. Well, we can support 64-but ARM with fallback C code but the performance is horrible.
I think 64-bit ARM support is also important, but I won't be able to do this as I do not have any arm64 devices available.
The move to 64-bit ARM probably won't require a huge amount of changes. The most important changes are described here: http://community.arm.com/docs/DOC-8453
We do and we can test and report back:
One of the most important changes is the reordering of the s/d/q registers between 32-bit and 64-bit. It's not really an option for me as I believe this will require a lot of back and forth testing/verification on arm64.
I was under the impression that ARM 64 bit instruction set is backwards compatible with 32 bit. Maybe that is not relevant here?
Correct, but as far as I know only in 32-bit mode. Check the above docs for more information.
This would be needed for using videoconvert and videoscale properly on iOS, as it's impossible to allocate executable memory on iOS nowadays for Apple reasons.
https://bugzilla.gnome.org/show_bug.cgi?id=742843 is about this issue too. Basically orcc can output NEON assembly at build time already, but there are some problems with getting that to run and also with outputting something for orc programs that contain unsupported opcodes for the specific target.
Ideally we would change orcc to output backup C code plus inline assembly (or in a separate file is ok too), and then at runtime it selects which one to use. As that is still possible on iOS.