codeforscience / webdata

Discussion on improving numeric computing/data science on the web (JavaScript, HTML5)
164 stars 3 forks source link

Native code (Bindings, WebAssembly) #6

Open max-mapper opened 9 years ago

max-mapper commented 9 years ago

An important part of the success of R and Python as data science tools is the ability to wrap native code (C/C++) in order to write memory/cpu critical algorithms in C so you can avoid overhead.

For a great introduction to the Python data science use cases, check out this talk by Rob Story. It's a whirlwind tour of different tools designed to address different problems. Most of them rely on a C/C++ component under the hood, but all of them expose an easy to use Python API.

To use native code from a dynamic language you have to write bindings, which usually means writing code in C/C++ that interfaces the native interface of your language with the code you want to use. These bindings have to be compiled in order to be used by users. Some people build the binaries when they publish new releases, that way users don't have to compile anything to use them. Others rely on their users to compile the bindings before they can use them. I'm going to talk about prebuilt binary use cases here, as it is the most user friendly option.

In Python there is conda which is designed specifically for distributing and installing prebuilt Python native bindings.

In Node there are node-pre-gyp and prebuild, both of which hook into npm install and try to download prebuilt binaries from some server the maintainer specifies before falling back to compiling them if the prebuilts aren't available.

To actually write bindings in Node there are a couple third party modules you can use to make the process easier: bindings and nan. One big advantage of using nan is that it gives you a compatibility layer in C++ that lives between node and your code. nan focuses on backwards compatibility as much as possible, so when the Node.js C++ API makes breaking changes, nan will hopefully be able to avoid have to make any breaking changes. This means when new versions of node get released, and you used nan, you hopefully shouldn't have to rewrite any C++ code.

In practice writing native bindings for node is still quite low level, even with these helper utilities. For example the module mknod, which exists to wrap the mknod syscall, is ~8 files and a couple hundred lines just to make this one line possible to call into from JS: https://github.com/mafintosh/mknod/blob/master/mknod.cc#L16. Maintaining this module means making sure to compile new versions of it when new versions of nan are released, making any necessary code changes, and making sure to upload the prebuilt binaries.

WebAssembly

For browsers there is a proposal called WebAssembly (.wasm) that is trying to standardize a way to run compiled native bytecode in a browser, safely. This will hopefully bring similar advantages to browser JS apps that it brings to Node and Python -- the ability to drop down to a lower level language for performance critical use cases.

For an introduction to WebAssembly read the following:

For data scientists I think .wasm will be:

I should note that I am very much not an expert on the current state of WebAssembly, and haven't been following it too closely. If any of this is inaccurate, please clarify in the comments below.

Open questions

If you have comments, questions, clarifications or if I missed something important please leave a comment below.

Planeshifter commented 8 years ago

In contrast to Python and R though, JavaScript can itself be blazingly fast. So from a performance standpoint, I don't think that the situations are comparable. Being not very familiar with Python, I can share my experience with respect to R in this regard, though: Since R itself is horribly slow, for any performance-critical operations you want to move to C/C++, which nowadays is quite smooth because of the excellent work by Dirk Eddelbuettel on the Rcpp package.

The developers of the Julia language have released a table with performance benchmarks which shows how bleak the situation for R actually is (see here: http://julialang.org/). I recall there was quite a bit of backlash after this was released, as there are indeed faster ways to accomplish some of the benchmark tasks in R. But these criticisms were a bit deceptive, as most of these would involve loading off some work to C / C++. So the point stands: Writing R code natively can be pretty slow.

In contrast, the performance of JavaScript is very, very good for a dynamically & weakly typed language due to the investments by the browser vendors into the different run-times, starting with V8 in 2008.

So while there will be performance gains my moving code to C/C++, the main benefit I would see is the ability to reuse legacy code-bases. In fact, last summer I applied for a Google Summer of Code project with the goal of transpiling C/C++ libraries to JavaScript via emscripten, which would have been advised by Mikola Lysenko. It did not work out in the end because our umbrella organization, the jQuery foundation, did not get enough slots to support such a side-project, so I am not sure what could have been accomplished.

However, after writing more and more JavaScript modules in the domain of numerical computing, I am not so sure whether this road is so desirable at all. If JavaScript is fast enough, why not rewriting the algorithms in JavaScript itself? There are at least two advantages that come to mind instantly:

All in all, I would say that while better facilities to interact with C/C++ and Fortran code bases would be a big plus, it won't change the game overnight. This is also because even with such functionality, it is a lot of work to write the bindings and a user-facing API, and there do not seem to be many people willing to make that effort. For example, on npm there are quite a few semi-abondened projects trying to port C or C++ linear algebra libraries to JavaScript by using native addons. None of them seems to be feature-complete, as far as I can tell.

max-mapper commented 8 years ago

@Planeshifter I agree the most ideal outcome would be that JavaScript adds missing features to enable people to write their algorithms in JavaScript. However, I would argue this is dependent on quite a few of the other issues in this repository being standardized and implemented.

Without increasing JS memory bandwidth + memory limits there will always be an advantage to dropping down to C to cram more data in memory etc. In fact there will always be an advantage, the question I wonder about is can JS get close enough to C to make it not worth the time to invest in a C implementation (taking into account things like the cost of context switching between JS and C etc).

Another thing I'm curious about is the likelihood of these different approaches being delivered to JS runtimes, e.g. the relatively complexity of the respective standards + implementation challenges. For example the main reason we have Typed Arrays in JS is because of WebGL, e.g. a side effect of the games industry. I think a lot of WebAssembly stuff is being pushed forward by demos like the Unreal Engine being run in a browser using WebAssembly, so perhaps that means the WebAssembly route has a higher likelihood of getting implemented over all of the other individual JS language enhancements.

I don't believe that the JS process is a zero-sum game either -- I think we can have all of these things (or none of them), it really depends on if there is a force driving each specific proposal forward (for example the WebGL spec had good momentum and a champion, whereas currently nobody is championing the 64 Bit Integers proposal so it's impossible to estimate if or when it will ever get standardized or implemented).