asm-js / validator

A reference validator for asm.js.
Apache License 2.0
1.78k stars 148 forks source link

String Type #46

Open espadrine opened 11 years ago

espadrine commented 11 years ago

Could we consider the inclusion of string types? It may be an immutable subtype of extern.

String type marker:

arg = ""+arg;

It would have the following expressions:

This is motivated by the idea that converting an external string into a typed array before passing it in an asm.js exported function is probably not the most efficient approach. Yet, in my experience, string parsing in JS can be boosted a lot.

michaelficarra commented 11 years ago
  1. Why is the length member signed? How can we create a negative-length string?
  2. How do we guarantee charCodeAt resolves to the original String.prototype.charCodeAt?
espadrine commented 11 years ago

Why is the length member signed? How can we create a negative-length string?

Thanks! Fixed.

How do we guarantee charCodeAt resolves to the original String.prototype.charCodeAt?

Similarly to the check for a bogus global, if String.prototype.charCodeAt is altered, it would default to interpreting the asm.js module, instead of using the compiled version.

michaelficarra commented 11 years ago

@espadrine: How do you statically check that String.prototype.charCodeAt is altered?

espadrine commented 11 years ago

@michaelficarra How do you statically check that window.Math.sqrt is altered?

Those are part of the runtime checks at linking time.

michaelficarra commented 11 years ago

So you'd like to add it to this list?

espadrine commented 11 years ago

So you'd like to add it to this list?

No. It isn't meant to be a function call in the standard library. However, it would add a field in that list.

kripken commented 11 years ago

The problem is that the prototype can change after linking. We avoid that with standard library stuff by saving them in the asm closure. But if we call "string".charCodeAt later on, the String prototype might have been changed in the meantime.

This isn't the only challenge here - adding this means support in the + operator, presumably. And also it means we can have GC'd objects in asm.js.

None of which is impossible, but the question is the motivation. if it's just efficient string processing, we should measure that first.

Regarding string efficiency, there is an idea to do a StringView for typed arrays. Basically a typed array is a view into an ArrayBuffer, and a StringView would view the same buffer but present it as string data (C-style null-terminated). If string performance is a concern, this might be worth investigating too.

espadrine commented 11 years ago

The problem is that the prototype can change after linking. We avoid that with standard library stuff by saving them in the asm closure.

Hmm, I see. Can we add it to the standard library, then, like @michaelficarra suggested?

The call can look like stdlib.String.prototype.charCodeAt.call(str, index).

This isn't the only challenge here - adding this means support in the + operator, presumably.

I would view such a construct as immutable. The length of the string doesn't change, its content doesn't either. Heavy-lifting parsing operations on huge strings usually don't involve string concatenation.

I am not sure how I feel about StringView for two reasons:

  1. It doesn't exist yet,
  2. We can already easily convert a string into an Uint16Array(str.length) and pass it to asmjs code. However, this conversion cannot be optimized. Flattening a normal JS string into an efficient form intuitively sounds like it can be optimized to be faster.

That said, supporting a wilder collection of string operations than simply reading a character at a given index, built-in, can be nice. I'm just really not sure we can be faster than normal JS there.

ScatteredRay commented 11 years ago

We can already easily convert a string into an Uint16Array(str.length) and pass it to asmjs code. However, this conversion cannot be optimized. Flattening a normal JS string into an efficient form intuitively sounds like it can be optimized to be faster.

Actually, let's think about this for a bit, is it possible, that we could optimize specific instances of UInt16Array conversion and back to work efficientlly, and without conversion? I mean intArray[i] looks awfully similar to str.charCodeAt(i)

timmutton commented 11 years ago

I believe introducing string support would be beneficial at least for lljs. As it stands, even a basic "hello, world" fails to validate using James Long's lljs fork, furthermore if you were to do any webgl using lljs that would also fail to validate due to the shaders requiring strings.

If there were support for fixed-length strings, and possibly support in stdlib for string generics (which could be done as a shim in other browsers like Math.imul) that would cover a lot of basic use cases

jlongster commented 11 years ago

Tim, while string support would be nice, you don't need it to use WebGL. You can load your shaders in js land and only do the computationally expensive stuff in asm.js. My cloth demo uses WebGL (http://jlongster.com/s/lljs-cloth/), you can see the whole program here: https://github.com/jlongster/lljs-cloth/blob/master/verlet.ljs

timmutton commented 11 years ago

You're completely right. I would like to be able to do a whole app in LLJS, but that would require lljs/asm supporting strings, or possibly being able to mark whether a lljs function/struct should use asm or not (which would likely introduce a whole host of other complications). For the time being your solution works very well though

cscott commented 11 years ago

Note that strings are not usually implemented as a flat array of u16 under the hood. In order to support efficient string append, a linked data structure (such as "ropes") is usually used. So it's not necessarily straightforward to provide a view of the UTF16 data backing a string.

martingala commented 11 years ago

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays/StringView

timmutton commented 11 years ago

Awesome, that looks really good

martingala commented 11 years ago

@timmutton

Awesome, that looks really good

Thank you ;) (I'm User:fusionchess, the author of that library... StringView is in alpha test for now!!)

martingala commented 11 years ago

you can help me to find bugs ;)

timmutton commented 11 years ago

haha yeah I'd be more than happy to do so, does that mean that it's in nightly now?

martingala commented 11 years ago

does that mean that it's in nightly now?

of course! I completed it on June, 6... ;)

EDIT: I fixed a bug just now!

timmutton commented 11 years ago

Fantastic! I'll give it a crack after work

timmutton commented 11 years ago

Sorry, when you say it's in nightly, do you mean I need to copy stringview.js from the page you linked and then use it, or do you mean I can just call a StringView from my code. The reason I ask is because I've updated nightly and Im getting a reference error

martingala commented 11 years ago

You need to copy stringview.js from the page I linked...! (sorry)

P.S. Look at the revision... when I change something I update the revision number...:

StringView - Mozilla Developer Network - revision #3

Bye :)

timmutton commented 11 years ago

Fantastic. Been playing with it for a little, looks like it has potential. Will have to wait until it works with asm to know for sure (or if it does, an example would be great because I can't get it to validate)

martingala commented 11 years ago

Great :) I think I will change the method StringView.prototype.makeIndex() soon ( https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays/StringView#StringView.prototype.makeIndex%28%29 ): 1) renaming its name to "StringView.prototype.getLength()" 2) or changing its return to an index from zero rather than a raw length from skipOffsetIndex... Send your suggestions if you have ;)

Good hacks ;)

P.S. I also don't like very much the name of the property "stringView.bufferView". Have you any name suggestion alternative to "bufferView"? :)

cscott commented 11 years ago

Quick review: a) this discussion is veering offtopic for asm.js; you should open a bugzilla for StringView first, and then once consensus on that has been reached, you can raise the issue of StringView in asm.js. That said, there's no apparent issue with invoking StringView methods as foreign functions and/or using asm.js to access the backing storage directly. b) you use the word 'characters' often in your documentation, which is rather misreading. You should try to make it very clear when you are referring to elements of the backing array (whether Uint8, Uint16, Uint32, etc) and when you mean codepoints (a collection 1-6 Uint8 elements for UTF8, 1-2 Uint16 elements for UTF16, 1 Uint32 element for UCS4, 1 Uint8 for ASCII, or something else). c) Similarly, the methods called (eg) "toBase64" make the conversion unclear. Do you mean to return the base64-encoded string corresponding to the UTF8 encoding of the codepoints stored in the stringview? Or the base64 encoding of the UTF16 encoding of the codepoints? Or the base64 encoding of the "natural" contents of the backing array, in which case you need to specify whether little-endian or big-endian encoding of the backing array is expected. d) In the introduction you claim the the library is "highly scalable". I think you mean "extensible". e) I think you'd be better off creating a family of StringView subclasses, in the same way that Uint8, Uint16, etc are subclasses of ArrayBufferView. You'd then have UTF8StringView, UTF16StringView, UCS4StringView, etc. This would allow better optimization of the string view methods, instead of having to select one of a number of different implementations based on the underlying encoding.

martingala commented 11 years ago

@cscott Thank you for your review ;) briefly...

a) this discussion is veering offtopic for asm.js; you should open a bugzilla for StringView first, and then once consensus on that has been reached, you can raise the issue of StringView in asm.js. That said, there's no apparent issue with invoking StringView methods as foreign functions and/or using asm.js to access the backing storage directly.

I made StringView as a generic API... asm.js is only one of its possible usage, I think...

b) you use the word 'characters' often in your documentation, which is rather misreading. You should try to make it very clear when you are referring to elements of the backing array (whether Uint8, Uint16, Uint32, etc) and when you mean codepoints (a collection 1-6 Uint8 elements for UTF8, 1-2 Uint16 elements for UTF16, 1 Uint32 element for UCS4, 1 Uint8 for ASCII, or something else).

Yes, sorry for my poor english, I'm italian ;) when I use the word "character" i mean "codepoint".

c) Similarly, the methods called (eg) "toBase64" make the conversion unclear. Do you mean to return the base64-encoded string corresponding to the UTF8 encoding of the codepoints stored in the stringview? Or the base64 encoding of the UTF16 encoding of the codepoints? Or the base64 encoding of the "natural" contents of the backing array, in which case you need to specify whether little-endian or big-endian encoding of the backing array is expected.

The return of toBase64() corresponds to the bytes of the stringView encoded into a base64 string, even when the stringView is UTF-16/UTF-32 encoded.

d) In the introduction you claim the the library is "highly scalable". I think you mean "extensible".

Yes ;) I'll try to improve the english of that page...

e) I think you'd be better off creating a family of StringView subclasses, in the same way that Uint8, Uint16, etc are subclasses of ArrayBufferView. You'd then have UTF8StringView, UTF16StringView, UCS4StringView, etc. This would allow better optimization of the string view methods, instead of having to select one of a number of different implementations based on the underlying encoding.

It is an idea. But it it would be only an "aesthetical" idea I think, because the only thing which would change would be some "if" statements...: instead of 'if (stringView.encoding === "UTF-8")' there will be something like 'if (stringView.constructor === UTF8StringView)'... etc... during conversions. And in some cases it is not important the encoding choosen, so I don't know if it would be a good idea to split the StringView constructor...

cscott commented 11 years ago

I made StringView as a generic API... asm.js is only one of its possible usage, I think...

That's why I'm surprised that this discussion is taking place in the asm.js bugtracker.

The return of toBase64() corresponds to the bytes of the stringView

You need to specify endianness, then; the underlying ArrayBufferView leaves this undefined.

the only thing which would change would be some "if" statements

Virtual method dispatch is your friend.

martingala commented 11 years ago

@cscott

That's why I'm surprised that this discussion is taking place in the asm.js bugtracker.

I haven't published that library elsewhere, so I haven't a bugtracker. I think that a good idea would be to move this discussion to my MDN discussion page...: https://developer.mozilla.org/en-US/docs/User_talk:fusionchess

You need to specify endianness, then; the underlying ArrayBufferView leaves this undefined.

Only one endian is supported: the one choosen by the JavaScript engine!

Virtual method dispatch is your friend.

My syntax cames from tradition... like in Java...: new OutputStreamWriter(System.out, "UTF-16"); I like it :P

martingala commented 11 years ago

@cscott P.S. I have updated the StringView page on MDN with your suggestions. I have also updated the makeIndex() method ( https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays/StringView#StringView.prototype.makeIndex%28%29 ) Bye :)

trusktr commented 8 years ago

Out of curiosity, what are you guys planning to make (or have made) with asm.js + StringView?