EricssonResearch / openwebrtc-gst-plugins

OpenWebRTC specific GStreamer plugins
BSD 2-Clause "Simplified" License
51 stars 46 forks source link

Enable orc code generation at build-time for iOS #26

Open sdroege opened 9 years ago

sdroege commented 9 years ago

This would be needed for using videoconvert and videoscale properly on iOS, as it's impossible to allocate executable memory on iOS nowadays for Apple reasons.

https://bugzilla.gnome.org/show_bug.cgi?id=742843 is about this issue too. Basically orcc can output NEON assembly at build time already, but there are some problems with getting that to run and also with outputting something for orc programs that contain unsupported opcodes for the specific target.

Ideally we would change orcc to output backup C code plus inline assembly (or in a separate file is ok too), and then at runtime it selects which one to use. As that is still possible on iOS.

ijsf commented 9 years ago

Just a comment on current performance: I just tested the latest master of everything on iPad Mini 1, and performance is crawling Bowser to a near halt. Using Time Analyzer, the main bottlenecks seem to be the orc_* related functions involved with video (notably orc_executor_emulate). Not really surprising.

P.S. Another bottleneck seems to be Opus and all the audio analysis it's performing, but I believe that was mentioned in another issue already, as well as aes_encrypt taking some time.

ford-prefect commented 9 years ago

I ran Bowser on an iPod Touch, and the top threads for CPU consumption were:

  1. intervideosrc, 8.5%, most time spent in gst_buffer_make_writable performing a copy
  2. audioconvert, 8.3%
  3. opusenc, 8.2%
  4. videoconvert, 8.1%
  5. audioconvert, 6.1%
  6. nicesrc, 6.1%, 20% in debug code, reported in https://github.com/EricssonResearch/openwebrtc/issues/196)
  7. vp8dec, 5.8%
  8. videoconvert, 5.3%
  9. audioresample, 4.0%

There are a number of issues to tackle here:

  1. Merging some of @ikonst's work to decrease the amount of audio conversion, and possibly extending that to move the work to the RemoteIO AudioUnit might help
  2. The fact that the video encoder doesn't even turn up in the top 10 might need investigation
  3. If we get Orc to work, that should help perf on 4 of those *convert threads
  4. Opus git suggests that some NEON optimisations were added since 1.0.3 which we are using -- this might help as well
  5. I wonder if we can/should look at the possibility of pushing out non-writable buffers from intervideosrc (since we only modify metadata)
superdump commented 9 years ago

do you know if we're sending h264 in that case? I expect so... In which case maybe that isn't weird at all.

The memory copy in intervideosrc was surprising. I think zero copy and queue like behavior are really needed for those to have low latency and high performance/low power drain applications.

videoconvert is more prominent in Bowser because of the rendering mechanism which needs to convert both the self and remote views to BGRA to send as bitmaps to the WebView for rendering. In a real app using glimagesink or similar, those would not be visible. As such and in general, we should perhaps try to optimize the colour space of the video for encoders and rely on having fast colour space conversion in renderers.

As for audio, absolutely we should try to avoid conversions that are unnecessary. Noops for the win.

ford-prefect commented 9 years ago

Correction: we're already on Opus 1.1, but it doesn't seem like we're actually using the appropriate optimisations. Will take a look at this.

Update: the recipe's already supposed to be setting things up to use intrinsics, so bumping investigating this down.

ford-prefect commented 9 years ago

We're using vp8enc, which has 1.2% CPU usage. It's possible the frames just aren't reaching the encoder.

superdump commented 9 years ago

Or that they're perfect black frames?

sdroege commented 9 years ago

intervideosrc and gst_buffer_make_writable() will only copy the underlying memory if it wasn't shareable, which is probably the problem here. We can't push non-writable buffers out there because we need to update the timestamps

ijsf commented 9 years ago

So here's the full analysis.

Scenario: testing with demo.openwebrtc.io:38080, A-side: Bowser (master), B-side: Chrome (OS X 10.9.5), local LAN. Hardware: iPad Mini 1 (A5, armv7)

Remarks:

Time Analysis (running ~10 minutes, calling ~9 minutes):

(Excludes minor unavoidable functions involved with sockets and iOS libraries such as QuartzCore.)

Quite a few interesting bottlenecks apart from orc. What's also interesting is the fact that the video encoder itself doesn't really show up here, while audio and video conversion does seem to be a significant hit. Seems a little backwards.

superdump commented 9 years ago

http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/?id=08b1d27e14e28ed028d73de1e5f7ec2e9c0a0d71 this might fix the make_writable() copying in intervideosrc

  • @ikonst is working on some stuff to correctly configure the source that should either eliminate or reduce (depending on what Core Audio decides to do) the audioresample stuff
  • All of the audioconvert, videoconvert, orc stuff should be improved by fixing this issue
  • Opus has improved optimisations for ARM and others on the way in their 1.1.1 release for which there is a beta. We updated to 1.1 recently I think.
  • The two memmoves are due to the rendering method used in Bowser where we create a bitmap from the raw BGRA frame (_owr_image_renderer_pull_bmp_image) and then send it over to UIWebView for drawing (WebCore::SharedBuffer::createPurgeableBuffer)

I don't think there's so much left after that. :smile: Crypto pops up but maybe addressing https://github.com/EricssonResearch/openwebrtc/issues/199 will fix that

stefanalund commented 9 years ago

@ijsf you mentioned earlier that you were interested in helping out. How about looking at this issue (ORC)?

ijsf commented 9 years ago

Sure, I'll try and have a look at this once my rebuild is done. I've forked the orc repository at https://github.com/ijsf/OpenWebRTC-orc

ijsf commented 9 years ago

Currently working on adding a --static-implementation option in the above fork.

ijsf commented 9 years ago

This fork is now capable of compiling static functions with inline asm code (and backup C functions for those that don't have an asm implementation). It's almost compilable.

The drawback is that I also had to clone gstreamer-common, the gstreamer and all gst* repositories, as common/orc.mak (which contains the arguments for orcc) is a submodule (gstreamer-common) in the gst* and gstreamer git repositories.

Furthermore, I found out there's also a "--target " (neon for iOS) command-line argument missing there (just hardcoded to neon for now, needs to be fixed).

So in short, the command-line needs to change from "... --implementation ..." to "... -target --static-implementation". Could somebody give me some pointers on how to do this as easy as possible?

sdroege commented 9 years ago

I think there should be a configure parameter to select this: --enable-static-orc=targets where targets can be a list of targets.

Then orc.mak would do its thing depending on that.

ijsf commented 9 years ago

This is done as far as I'm concerned.

superdump commented 9 years ago

@ijsf I'll keep it open until it lands and the above bugzilla ticket is closed. Should we create a separate issue about 64-bit ARM support? As far as I recall, that's a requirement for new App Store submissions. Well, we can support 64-but ARM with fallback C code but the performance is horrible.

ijsf commented 9 years ago

I think 64-bit ARM support is also important, but I won't be able to do this as I do not have any arm64 devices available.

The move to 64-bit ARM probably won't require a huge amount of changes. The most important changes are described here: http://community.arm.com/docs/DOC-8453

superdump commented 9 years ago

We do and we can test and report back:

ijsf commented 9 years ago

One of the most important changes is the reordering of the s/d/q registers between 32-bit and 64-bit. It's not really an option for me as I believe this will require a lot of back and forth testing/verification on arm64.

stefanalund commented 9 years ago

I was under the impression that ARM 64 bit instruction set is backwards compatible with 32 bit. Maybe that is not relevant here?

ijsf commented 9 years ago

Correct, but as far as I know only in 32-bit mode. Check the above docs for more information.