coder-mike / microvium

A compact, embeddable scripting engine for applications and microcontrollers for executing programs written in a subset of the JavaScript language.
MIT License
569 stars 25 forks source link

Plans for string prototype methods? #65

Open boogie opened 1 year ago

boogie commented 1 year ago

As far as I see, even basics like [num] to access a character, or length property are not supported (there's a reference in the string test, but only as a comment). What is the plan about adding these, and basics like substr, indexOf, split, charCodeAt and fromCharCode? If there's no plan, how do you think it is the best to implement these?

boogie commented 1 year ago

I think it is a similar answer to #64, however, the index (str[0]) format seems to be not implementable without parser/engine support.

coder-mike commented 1 year ago

Yeah, it's a similar story. But unlike arrays, Microvium does not have a mechanism to self-reflect on the content of its own strings. The reason for this is that strings in Microvium are defined as being UTF8-encoded, since it's potentially more compact and more aligned with what people commonly use in C, whereas the ECMAScript standard defines it as UTF16. So, a spec-conformant implementation of string.length cannot just return the size, but must actually iterate the string data to translate UTF8 to UTF16 to get the equivalent UTF16 size. Similarly for str[5] which needs to find the 6th UTF16 code point. See here on MDN for a short explanation of how strings normally work.

Roughly speaking, your options are:

boogie commented 1 year ago

What I'm trying to implement is sending messages with Bluetooth to the device and to the Microvium engine. These are string messages with different content. The device should be able to process them with a JavaScript code. Also there are buttons on the device, and I would like to process button presses as well. Different sources are sending different messages on a generic string interface. The device also have an RGB led and a vibration motor.

When I'm sending "%rgb", I would like to set the RGB led to red, then after 0.5s then to green for 0.5s and finally to blue for 0.5s. When I'm sending "~...---..." the vibration motor should play SOS vibrations.

I would like to enter numbers with the numeric buttons, like 1234. My idea is appending every digit, and checking the string length. When it is 4 digits, I would like to do a calculation with it, and then send a Bluetooth message to an other device.

This would be a generic tool that can process, transform events and data, etc. I would like to allow users to write code and create their solutions. I also would like to write some code in JavaScript to accelerate development.

So in general, I would like to do basic text processing. I see there are workarounds, but would be great to keep it simple and easy to understand. It's hard to explain Uint8Array.

I think iterating over UTF-8 strings to figure out the 6th character is not a problem, it will be quick for most of the cases. Same for length. As the strings are mostly short, I would be happy to go with that as an MVP solution. I'm not working with non ASCII strings, but maybe later it will be necessary, I know this problem.

coder-mike commented 1 year ago

Ok, I'll take a look at it. I propose the following solution:

  1. I will add in a builtin string prototype object and global String object. The global String object will not be a constructor.
  2. Property access on strings using non-number keys (e.g. str.foo) will be delegated to the string prototype, except .length which will return the number of equivalent UTF-16 code units or throw if the string is not valid UTF-8. And .__proto__ which will return the prototype itself.
  3. The prototype will have just one method, str.charCodeAt, which will return the equivalent UTF-16 code unit. It will throw a type error if the underlying string is not valid UTF-8. The returned code unit is allowed to be part of a surrogate pair. Users can add more methods to the string prototype if they choose.
  4. The global String object will have one method, which is fromCharCode. This will take only one or two arguments, not an arbitrary number of arguments. The two arguments must be two valid UTF-16 code units. The function will return the equivalent string (internally encoded as UTF-8). It will throw if the arguments together do not form a valid unicode code point.
  5. Integer property access on strings, such as str[5], will return a new string of length 1 which is equivalent to calling String.fromCharCode(str.charCodeAt(5)). This will throw if the code unit 5 is part of a surrogage pair.

I believe that this is the minimum that would be required for a user to write all the other string methods, e.g. in the form of a library. charCodeAt, fromCharCode and length give a user access to the equivalent UTF-16 code points, and the ability to extend the prototype makes the solution extensible.

boogie commented 1 year ago

Oh, this sounds awesome. Thanks a lot. I agree that having these will allow me to implement all the features I would like to.

boogie commented 1 year ago

Probably I've closed this issue by accident. Your proposed solution is great.

boogie commented 10 months ago

When do you think you can add these features? (No pressure)

coder-mike commented 10 months ago

Hi. Sorry, my 2-year-old started going to childcare earlier in the year and suddenly started catching all these viruses from the other kids and has been sick almost constantly since then and it took all my energy and time. I'm starting to get back on track now, but I want to close off a few half-completed branches/features before I start on this one.

Let me see what I can do. Apologies for the long delay. I'll see if I have time later this week, otherwise we're looking sometime in August probably.

boogie commented 10 months ago

No worries. Mines are 11 and 13, and it is still happening. :D 🤞 Thanks, and looking forward for the updates.

coder-mike commented 10 months ago

WIP - the branch 65-string-methods has support for str.length and str[i] if the strings are only ASCII.

Unicode support is a rabbit hole and I'll need to resume it another time.

My revised proposal is this:

The reason for the MVM_TEXT_SUPPORT macro is that this feature is actually really expensive. The initial draft implementation for unicode support for just str.length and str[i] was almost 600 bytes of program space compared to the current 10kB for the whole engine.

coder-mike commented 7 months ago

Sorry, this isn't likely to happen any time soon. I'm swamped with other things. If anyone else wants to pick up this work and contribute or help, the branch is 65-string-methods.

coder-mike commented 6 months ago

How important are the string prototype methods? I'm encountering multiple problems in the implementation.

The first is that I don't want to force a whole object to be allocated in memory for people who don't use that feature. I think I can get around this by saying that the string prototype object is allocated lazily at compile time. So if you access ''.__proto__ at compile time then it will create the string prototype, which will persist to runtime.

The second problem is more significant which is that in the spec, method calls on strings will pass a this parameter which is not the string itself but a String object. As in the following:

''.__proto__.myMethod = function () { console.log(typeof this) }
'abc'.myMethod(); // Logs "object" not "string"

Microvium doesn't have these wrapper types like the String object. I'm hesitant to add them because I think it would be a lot of overhead for a feature that's not directly useful in itself (I think most code avoids String objects and instead just uses primitive string types), and it could complicate stuff like string coercion.

I could leave it non-compliant and just pass this as the string primitive instead of the object wrapper. But I prefer to avoid adding non-compliant behavior to the engine.

So the question is, how important is it? And, can you (or anyone reading this thread) think of any other clever ways of supporting it in a way that is compliant and doesn't add much cost to those who don't use the feature?

boogie commented 6 months ago

Hi, from a JavaScript developer's point of view, I think they are a very important feature of the language. Even basic string manipulation is hard without them. Please note, that in JavaScript, strings are most of the time NOT objects, ONLY behaving like objects. Probably this is the solution you are looking for. https://developer.mozilla.org/en-US/docs/Glossary/Primitive

"Primitives have no methods but still behave as if they do. When properties are accessed on primitives, JavaScript auto-boxes the value into a wrapper object and accesses the property on that object instead. For example, "foo".includes("f") implicitly creates a String wrapper object and calls String.prototype.includes() on that object. This auto-boxing behavior is not observable in JavaScript code but is a good mental model of various behaviors — for example, why "mutating" primitives does not work (because str.foo = 1 is not assigning to the property foo of str itself, but to an ephemeral wrapper object)."

Of course, you have to consider what are the use cases of Microvium you would like to support. My idea is that if you have a device that has a display, you most probably work with strings, and likely to have to manipulate them.

boogie commented 6 months ago

I think passing the primitive value as this is just a good enough solution. I have no real use cases in my mind where you need more, I think it is 99% compliant. As a JavaScript developer, I have never used strings as real objects, except using their attributes/methods on them.