WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 696 forks source link

Compiler Package Manager for Web Assembly? #761

Closed mofosyne closed 7 years ago

mofosyne commented 8 years ago

I am sure this has been thought of, but I haven't found this idea in google searches.

What would be interesting, is if the web assembly infrastructure/standard provide some mechanism for automatically dealing with compilers/interpreters besides javascript.

So much like having a package manager automatically download dependencies in linux. What may be useful is if there is a online repository of wasm binaries of interpreters and compilers for all the major languages in the web.

What this would allow, is for users to insert source code into their website besides javascript, take for instance those who want to give dart programming language a shot. The major issues hampering experimentation with languages besides javascript is the lack of support of browsers for other languages.

WASM helps solve the javascript hold over client side scripting, but at a potential cost to openness of the code. Hosting a common popular compiler infrastructure manager would encourage users to insert normal source code into their website besides javascript.

What does this mean in practice for the average user if this idea is implemented? It may mean that on the top of any html page, you should declare what language dependencies your html uses (does it use C, python, ruby, javascript, or dart etc...). This would prompt the browser to check a package manager for the latest compiler/interpreter that can understand these language.

I'm sure this is not the only way to approach the aspect of making wasm more open to others that chose to be open.

Bablakeluke commented 8 years ago

Hey there! Just to make sure I'm following you correctly here, my guess is the usage would be like this:

<script type='text/python'>
.. python here ..
</script>

Where the browser then internally looks up the compiler for python, stored as a wasm file, from a package repo.

Whilst I think this is an interesting concept, the major problem is the centralization of that code and the security problems that it brings up. Who would host it/ moderate it/ what's the submission criteria/ who's allowed to mirror it etc all would affect the security aspects. Essentially people won't want code from 3rd parties being uncontrollably included in their site; So something like this may be more suitable:

<script src='https://python.org/webRuntime.wasm'></script>

<script type='text/python'>
..python here..
</script>

The developer is in control of the origin (I can host it myself if I wanted) and it only involves one extra line; there's no centralized involvement which is overall better for openness anyway, IMO.

However, I think a major thing we would want to avoid is having compilers on most web pages; although it makes for a tight iteration cycle for a developer writing the code, i.e. just hit F5 to see changes, it results in a slower experience for site visitors as they'll need to wait for the compiler to parse the code every time - the same problem that's hitting JS at the moment. So overall, using WASM in it's current intended pattern seems like a better bet. That just leaves resolving the original problem - getting e.g. some readable Python out of a WASM file.

WASM does include a textual format which is intended to be human readable although it will never be able to directly match the readability of the original source. This had been a source of some really hot discussion earlier on; the original source could be included but the general consensus was that most developers would simply not do that. So there could be an optional WASM section which states the compiler that created it (and then as a result making it easier to reverse the WASM all the way to the source language) - currently WASM doesn't have a field like that (to the best of my knowledge) but it would be an easy extension. I think this would be better still than including compilers in the browser as it could still essentially result in semi-readable source.

For example:

myCompiledPython.wasm Includes a WASM section like this:

Meta section Origin: Python / 3.5.2 Other meta fields may be useful like Company/ Copyright/ License etc (possibly as a URL, e.g. mysite.com/license.txt)

If you know what produced the wasm then you can also know how to best reverse it to make it readable again.

kripken commented 8 years ago

I think standard web technologies can already do this.

We have runtimes like lua.vm.js which can execute non-JS code in script tag. The website could include a tiny amount of JS code that fetches the runtime if it isn't already in the cache (the cache could be IndexedDB or eventually use service workers, etc.). The result would be that the non-JS code in the script tag would just work.

The only thing WebAssembly can help with here is that startup performance might be better, compared to JS and asm.js. And that might in fact be one of the factors that have held back adoption of lua.vm.js and other compiled VMs. But my bet is that the bigger issues are JS and DOM integration, for example, the lack of weak refs, finalizers, etc. means you need to manually free certain links, so only JS is a first-class scripting language on the web currently.

mofosyne commented 8 years ago

You got the idea @Bablakeluke , yes it's to allow for non js in script type.

However you could argue that the issue with not preloading your own choice of runtime like Python, is that it means you have to download the runtime on the fly when using it.

Which for a crappy mobile broadband like mine would suck. So maybe having runtime inclusion could be the primary option. But perhaps placing a recommendation for browser devs to allow users to precaech popular runtime should be done as well. At least that won't require as much resources.

mofosyne commented 8 years ago

Hmmm... Is there anything we should consider about wasm alt script runtime support privileges separately from normal code? E.g. access to Dom etc...

lukewagner commented 8 years ago

With structured cloneable Module, one should be able to amortize the language runtime download cost by caching the runtime's Module in one origin's IDB storage and using postMessage() to share it between origins. Future GC and WebIDL integration will also help.

creationix commented 8 years ago

I've had an idea similar to this for a while.

Yes it can be done today by manually including the runtime as a library. You can even get clever and cache it somewhere if you don't trust the browser's cache. The problem is this type of solution available today doesn't cross domains. If I visit 3 domains all using the same python runtime, I'll need 3 copies of it on my computer.

We don't want some centralized authority who controls what exactly "python" is. There are security and freedom concerns there.

My idea is to borrow some of the properties from content-addressable-storage systems like git and ipfs. My app can declare that it depends on a runtime by it's immutable hash rather than some symbolic name. I can also include a link to at least one mirror I know of that hosts said resource if it's not in the cache already. Since the data is immutable (assuming a strong hash) I can be certain that the same hash downloaded by any other domain from any other host is the same to me since I know the content itself is identical.

This solves the security problems since the data is immutable. Think of the hash as a very high form of compression that happens to need outside help to decompress the data.

It solves the redundancy problems for data that is actually duplicate. Also this doesn't need to be limited to language runtimes for web assembly, it could be for popular JS libraries as well or any assets really that tend to be repeated across multiple websites.

It solves the freedom concerns because anybody can host anything and it's completely distributed. There is no central gatekeeper and the site author has 100% control over exactly what code is included on their page.

Bablakeluke commented 8 years ago

Hmm I don't see how that solves the cross domain problem - I would guess that the chances of any two sites that you've got open sharing exactly the same version would be quite unlikely, keeping in mind that every tiny bug fix would change the hash (or even simply rebuilding the WASM file - there's no guarantee of order), so the hashes they're using would very often be different. E.g. site 1:

<!-- Python version 3.5.2a4 -->
<engine tag="text/python" hash="a87234bcef532.." src="https://amirror.com/"/>

Site 2:

<!-- Python version 3.5.2a1 -->
<engine tag="text/python" hash="a8323ebca293.." src="https://amirror.com/"/>

Luke Wagner's suggestion above works much better for that purpose. The hash concept is interesting but digital signatures are much better at authenticity and they're built in to HTTPS so we're better off using that as much as possible.

For some clarity, here's roughly how Luke's suggestion would work:

Essentially the end result is multiple pages share the same runtime code with zero recompile cost (when possible). It's also authentic because it came from python.org over HTTPS. Site developers retain freedom because they can host any of that themselves; runtime developers retain their freedom because they don't need to upload their runtime to some specific group of servers. No extra effort is required (aside from structured clonable modules) because it largely just uses existing functionality :)

One way of improving on this is changes to IndexedDB with a way of allowing read-only crossdomain for particular entries; that would allow it to avoid instancing the iframe for the majority of cases. Alternatively runtimeLoader.js caches the Module in IndexedDB on your page and hits that one first.

creationix commented 8 years ago

I'm not sure this is an either-or problem. The hash-based system would be simply a new primitive. You tell the browser you need resource X with an optional list of possible mirrors that host it.

An hash system alone isn't very useful, you need some way to resolve symbolic names to hashes. I was planning on doing this at build time so that, as the author, I have complete control. But it can also be done at run-time over some trusted channel (such as https://python.org/...) so that your site gets automatic bug-fix releases published from the upstream.

I just think the hash primitive would give us better options and can be used standalone if the lookup is done offline outside the browser.

I would guess that the chances of any two sites that you've got open sharing exactly the same version would be quite unlikely.

I wish there was data for this. I'm curious, for example, how many websites have the exact same minified version of jquery or the other popular frameworks embedded in them. Maybe some of the CDN services can tell us?

Bablakeluke commented 8 years ago

I wish there was data for this.

I agree - out of interest I've just been searching around for the little information around, however, the stats aren't broken down as much as they'd need to be. Google hosts over 50 different versions of jQuery on their CDN, but no stats as per which of those is most common directly from Google.

So, roughly, the chance that two jQuery-using sites both use the most popular subversion (1.11.3, ~13.8%) is about 1 in 60. The chance of them both also being on the same CDN (Google's, the most popular at ~25%) is around 1 in 500. Yep, unlikely!

But it can also be done at run-time over some trusted channel (such as https://python.org/...) so that your site gets automatic bug-fix releases published from the upstream.

At which point you're better off just pulling the runtime straight from that trusted channel. Both involve a request to that server and the overhead is basically no different.

The web also already uses a hashing system (E-Tag) to check if a cached file has changed. It just seems like a bit of fiddling around for both the runtime creator and the site devs and with no particular gains (and potentially lots of losses, i.e. if it almost never gets a cache hit).

However though, an authenticity property is an interesting one. For example with jQuery, everybody simply trusts that Google is actually serving up the real jQuery; there's been a bunch of instances where such trust has been misplaced. So, WASM could include code signing; Python signs their WASM runtime, then CDN's host it. Any attempt to change the runtime (by the CDN or hackers) would require creating a new signature. Usage of that would be e.g:

<script src="https://code.googleapis.com/libs/pythonRT/1.0.0.wasm" signee="python.org"></script>

Where the browser only runs the wasm if it includes a valid signature from python.org

mofosyne commented 8 years ago

Should we separate the version code as well, with some way of specifying version flexibility (min max version) (version="V1.xx.xx" etc...). E.g. python V2 and V3 were both supported with security fixes applied to both.

<script src="https://code.googleapis.com/libs/pythonRT/1.0.0.wasm" signee="python.org" version="V1.xx.xx"></script>

But anyhow I think I like the idea of including runtime codes in the top, its not a bad idea, but preventing security issues from loading in external runtime codes is pretty important too.

jfbastien commented 7 years ago

This hasn't seen any activity in a while. It seems people are interested, I'd suggest creating a new repo to prototype this work.