UnitedLexCorp / SimpleTalk

An implementation of HyperTalk in Ohm-js
Apache License 2.0
7 stars 1 forks source link

On Chunking #42

Open darth-cheney opened 4 years ago

darth-cheney commented 4 years ago

Chunking: Proposals and Discussion

What is Chunking?

(For more detail, see PDF page 143 of The HyperTalk manual)

The original Hypercard/HyperTalk system made use of a kind of querying and manipulation pattern it called 'chunk expressions.' These expressions are essentially queries on types of objects that can be iterated upon, by the referring to indexes of named 'chunks'.

For example, consider the following HyperTalk expression:

first character of word 7 of third line of card field 1

This expression tells the system to get the value of the first field on the card (more on values later), which is text. It then says to 'chunk' that text into types of chunks called 'line', then get the third line in a sequence of such chunks. It then says to further chunk that smaller line/chunk of text into a type of chunk called 'word' -- and so forth.

Chunking is an important part of the HC system, and it was a powerful feature for users. Though it results in somewhat verbose scripting if wielded too often, there is never any doubt about "what is going on"

When implementing a generalized chunking system, we want to consider use cases that HC didn't even have the opportunity (or need) to reckon with, including:

I argue that because we will probably want at least a couple of these speculative abilities for chunking, we really need to think hard about how the chunking implementation should be structured.

Implementation Questions

I have pushed work to a couple of branches that does two different types of basic implementation for chunking. They have important features in common, which I will explain below. There are probably other methods that I haven't even thought of, and I hope that others will have suggestions.

The Most Primitive Chunkables: Strings and Arrays

In our HyperTalk example above, you'll notice that the expression is really just a kind of recursive 'chunking' of ever-smaller strings using different kinds of chunks (line, word, then character).

Perhaps the most common chunkable object type in the system is a String. Another type of thing that can be chunked is an Array of values. This tell us something important about the kinds of objects that should be chunkable: they are sequences of some kind.

With that in mind -- and given that we are implementing this system in Javascript -- I assumed for my implementations here that it makes the most sense to add to the base Array and String constructors in JS by giving them new 'chunking' abilities. This should handle 99% of our earliest and most common uses for chunking.

Approach 1

In my first implementation, which you can find at eric-chunking, only implements retrieval of chunks for a given object (more on this later). It does this by directly adding properties to String.prototype and Array.prototype, so that all Arrays and Strings in the system will have chunking capabilities. The implementation is small enough to include inline here:

/* JS String Chunking */
String.prototype.chunkers = {
    line: function(){
        return this.split("\n");
    },

    character: function(){
        let chars = [];
        for(let i = 0; i < this.length; i++){
            chars.push(
                this.charAt(i)
            );
        }
        return chars;
    }
};

String.prototype.chunkBy = function(aChunkType){
    if(!this.chunkers){
        return null;
    }
    if(!Object.keys(this.chunkers).includes(aChunkType)){
        return null;
    }

    return this.chunkers[aChunkType].bind(this)();
};

/* JS Array Chunking */
Array.prototype.chunkers = {
    item: function(){
        return this;
    }
};

Array.prototype.chunkBy = function(aChunkType){
    if(!this.chunkers){
        return null;
    }
    if(!Object.keys(this.chunkers).includes(aChunkType)){
        return null;
    }

    return this.chunkers[aChunkType].bind(this)();
};

The short explanation: each prototype now has an object called 'chunkers' whose keys are the names of kinds of chunks it can make of itself (think line, word, or character) mapped to functions that will produce those chunks. In order to call these functions with the correct context binding (ie, the string instance itself as the this keyword), a consumer will call someString.chunkBy('line') or whatever chunk name they are looking for. The return value, as it should be with anything that chunks, is an Array of the chunks.

This approach is fairly small, simple, and easy to grasp. However it leaves out a crucial capability of HC/HT that we have not discussed yet: scripters should be allowed to not only get the values of chunks, but also to set them.

Consider this HyperTalk script:

put "hello" into the second word of the third line of card field 2

Our simple implementation above will not be able to handle this setting. This is because it requires that we somehow keep track of the original path of the nested chunkers, and also implement a more universal way of "setting" something at that index.

For Array the pattern is simple, because Arrays only have one chunk type (item) and setting a value at an index of an array. But Strings in JS are pretty much immutable -- you can't modify them in place. Furthermore a word can be a subchunk of a line, but once we've pulled out a given line to chunk into words, we no longer (in the chunker function) have any concept of where the original lines came from, and we therefore cannot update the object of origin.

This is currently my biggest mental block to implementing chunking. My second pass attempts to deal with this complexity.

Attempt 2

Unfortunately, my next implementation is more complicated. You can find it on the branch eric-chunking-alt1.

Like the previous implementation, we are still adding to base String and Array prototype objects. We also still have a chunkers object, but rather than having its keys be chunknames referring to functions for making chunks, these instead now refer to a generic object that has three required methods of its own: Method Name Description
getAll() Returns an array of all chunks of the given kind
getAt(anIndex) Gets the single chunk of the given kind at the specified index
setAt(anIndex, aValue) This is key: this method returns a new copy of the original object, with the chunk at the provided index modified to be the passed in value

You can find the implementation here. I won't post it into this issue since it's a big longer.

How Else?

Given the summary above, how else could we implement generalized chunking? How can we deal with this "setting" issue in a clean, simple way?

Quick Note on 'Values'

In the original HyperTalk/Hypercard system, certain of the available Part objects also had 'values'. In practice this meant mapping one or more of the Part's properties to be its 'value', where a scripter could get or set. For example, the 'value' of a Field is simply a mapping to its textContent property (another way of saying a Field's value is its current text). This manifests itself in script as something like put "hello" into card field 1 (where put is setting the 'value' of card field, which is a mapping to its text property).

The 'values' of parts are designed for the scripter/user, and therefore have mapping that we as programmers might find unintuitive. For example, the 'value' of a Button part is the title that appears inside of the Button. put "click me" into button 3 will set the value of the Button's title to the given string. first character of button 3 will likewise get the 'value' of the button (its title text), chunk by characters, and return the first one -- in this case 'c'.

Both of my implementation branches above have accounted for the 'values' of Parts by introducing a new superclass for Part called ValueHolder, which expects a getter/setter for .value. The default implementation is to do nothing.

Open Questions

  1. What is a reasonable implementation that gives us the universality we are hoping for?
  2. What use cases can we imagine for chunking?
  3. Do we even want to implement chunking at all, or do something completely different (or not at all)?
ApproximateIdentity commented 4 years ago

I'm not totally sure I understand why these two are so different:

first character of word 7 of third line of card field 1

put "hello" into the second word of the third line of card field 2

Why not do it all at the "pointer" level? I.e. you implicitly change first character of word 7 of third line of card field 1 to get the first character of word 7 of third line of card field 1 internally. Then you basically view first character of word 7 of third line of card field 1 as returning a pointer. This way the get the is sort of like a dereferencing/access operator while the put ... into is basically a dereferencing/assignment operator. I mean could get a little tricky to get right, but it doesn't seem that bad to me.

That said I really feel like the second approach. In this case the generality seems to be a bit of a feature to me. I.e. by making it general you restrict the possible solutions. This seems good to me.

But I guess I have a final question. Is actually that important if you do the more general version first? I.e. is the more general version backwards-compatible with the less general version? In that case you'd just be saying that you can use chunking with some restrictions and then later you might remove those restrictions. In that case, the less general implementation doesn't really seem to lose anything to me. In other words, doing the (seemingly) simpler more specific caseisn't any long-term loss and might clarify thinking anyway. In that case, why not do the simpler more restricted case first?

dkrasner commented 4 years ago

Essentially "to chunk" is an operation that transforms an object into an array (or array-like object, i.e. something that has a lookup based on passed key, see more below), which can then be indexed (looked up), the items in the array manipulated (gotten, set, etc), and potentially "de-chunked" i.e. put back together

Various chunk references, like "word" "line" "5th" etc, are really just incidental. All they say is <> where the appropriate thing is known via context.

Another way to think about this is that a statement like put "hello" into the 5th word of the 10th line of card field 1 is equivalent to something like set("hello", chunk(5, chunk(10, chunk(1, card)) i.e. a nested set of chunk operations where each object understands what it is to be chunked (and then subsequently returns the corresponding index value).

I am imagining a setup where any object can accept a message of type: chunkMe, element: N to which it will either return DoesNotUnderstand or the Nth element of the "array"

I put array in quotes b/c in theory we can return whatever the thing is referenced by element: N - this could be a value in a json/dict where N is the key, or a subpart (put "hello" into button "hello button" of card 5 of stack1)

In addition, every object knows how to deal with a change to one of its chunks (i.e. a set-like operation) which then updates the value, or whatever "value" means for said object; or alternatively every object out there is made up of "atomic" parts, i.e. those irreducible by chunking (characters, basic objects - buttons, cards...), and a get operation recursively puts any object back together from its chunks (character -> word -> line -> paragraph -> field -> card ..)

darth-cheney commented 4 years ago

@ApproximateIdentity

Why not do it all at the "pointer" level? I.e. you implicitly change first character of word 7 of third line of card field 1 to get the first character of word 7 of third line of card field 1 internally. Then you basically view first character of word 7 of third line of card field 1 as returning a pointer. This way the get the is sort of like a dereferencing/access operator while the put ... into is basically a dereferencing/assignment operator. I mean could get a little tricky to get right, but it doesn't seem that bad to me.

Totally makes sense. The only issue currently is that we cannot easily set values based in these pointers when we are chunking a String, because the Strings cannot be modified in place (though basic Arrays can).

@dkrasner

Another way to think about this is that a statement like put "hello" into the 5th word of the 10th line of card field 1 is equivalent to something like set("hello", chunk(5, chunk(10, chunk(1, card)) i.e. a nested set of chunk operations where each object understands what it is to be chunked (and then subsequently returns the corresponding index value).

This is mostly right, though the recursive chunking pseudo-code you've provided does not take into account the kind of chunk being produced. In the case of plain arrays there is only the itemchunk. For Strings we have (for now) line, word, and character. It just so happens that these are hierarchical and will reduce, but can we imagine kinds of objects that provide chunks that won't necessarily reduce hierarchically (for example, an array of more complex objects that themselves are chunkable somehow)?

In addition, every object knows how to deal with a change to one of its chunks

This would be the ideal scenario for sure. Perhaps for the moment we can say that since the "ultimate source" of the initial chunking will be from a Part, we attach this chunking behavior only to Parts. For example, the Field part will know how to chunk its own value (which is the text content) into lines, words, and characters and likewise how to set based on indices of these chunks. That gets rid of some of my problems with modifying String.prototype and also gets us closer to Thomas' pointer notion.

What we would not be able to do in this case is something like character 2 of second word of "hello world", since the "ultimate source" would be a literal string and not a Part. I don't know a way to make a consistent interface for both actual Strings and Parts whose values are strings.

What I was trying to leave space for was future remote data returns et cetera -- information coming in that could also be 'chunkable.' But as long as we make the chunking interface on objects consistent then we can just use some kind of polymorphism when we need to. Double true if we implement this in messaging.

ApproximateIdentity commented 4 years ago

Why not do it all at the "pointer" level? I.e. you implicitly change first character of word 7 of third line of card field 1 to get the first character of word 7 of third line of card field 1 internally. Then you basically view first character of word 7 of third line of card field 1 as returning a pointer. This way the get the is sort of like a dereferencing/access operator while the put ... into is basically a dereferencing/assignment operator. I mean could get a little tricky to get right, but it doesn't seem that bad to me.

Totally makes sense. The only issue currently is that we cannot easily set values based in these pointers when we are chunking a String, because the Strings cannot be modified in place (though basic Arrays can).

Yeah I think that it's totally fine to change the string class to some sort of STString class and just add whatever methods you need. The fact that you've been able to represent your strings by js strings until now is just sort of a lucky accident. But if it no longer fits, then can replace the implementation.

dkrasner commented 4 years ago

This is mostly right, though the recursive chunking pseudo-code you've provided does not take into account the kind of chunk being produced. In the case of plain arrays there is only the itemchunk. For Strings we have (for now) line, word, and character. It just so happens that these are hierarchical and will reduce, but can we imagine kinds of objects that provide chunks that won't necessarily reduce hierarchically (for example, an array of more complex objects that themselves are chunkable somehow)?

All I meant by that pseudo code is that any one object that we interact knows how to "chunk" itself (understands the message "chunk" etc), or is an atomic object that does not chunk and responds accordingly. The output, whatever output here means, of chunk<- get(N) is another object. IE both chunk and get are polymorphisms and it's up to the objects to deal with what needs to happen.

So yes you can have something like

MyStack -> chunk() -> get(5) -> chunk() -> get(7)  -> chunk() -> get(4) -> chunk() -> set(1, "NewVar")
//stack    ->       //5th card        ->// 7th field (a math eqn)  -> // 4th eqn part     // 1st variable in eqn is now set to "NewVar"

all of the above is trying to show is that chunking here can produce Parts, fields, weird things we haven't defined like a math expression/equation which knows how to chunk itself (which might not mean transforming itself into an array at all), etc

This would force us to add a chunk message command handler to every object out there, with the default being DnU, i.e. atomic or unchunkable (there could be a distinction between these two states)

dkrasner commented 4 years ago

In addition, every object knows how to deal with a change to one of its chunks

This would be the ideal scenario for sure. Perhaps for the moment we can say that since the "ultimate source" of the initial chunking will be from a Part, we attach this chunking behavior only to Parts. For example, the Field part will know how to chunk its own value (which is the text content) into lines, words, and characters and likewise how to set based on indices of these chunks. That gets rid of some of my problems with modifying String.prototype and also gets us closer to Thomas' pointer notion.

I think the distinction on sending "chunk" to a Part or to a Part's value (note I imagine that Part.value is also an object which understand what to do when it gets the "chunk" message), is going to be mostly a problem in grammar/semantics and handled by the compiler. We would need to make sure that this distinction is clear in the grammar itself.

For example, field 2 of card is chunk operation on a part Card while paragraph 2 of field is a chunk operation on Field.value.

What we would not be able to do in this case is something like character 2 of second word of "hello world", since the "ultimate source" would be a literal string and not a Part. I don't know a way to make a consistent interface for both actual Strings and Parts whose values are strings.

If Part.value is an object that understands both "chunk" and "represent" then character 2 of second word of "hello world" would "return" a Part.value object :

Part.value.representation()
> "world"
Part.value.chunk(2)
> O // Value type Object
O.representation()
> "o"

What I was trying to leave space for was future remote data returns et cetera -- information coming in that could also be 'chunkable.' But as long as we make the chunking interface on objects consistent then we can just use some kind of polymorphism when we need to. Double true if we implement this in messaging.

darth-cheney commented 4 years ago

After reading suggestions by @ApproximateIdentity and discussing it on the phone with @dkrasner this morning, we decided to take the approach outlined below. I also read up on the lens pattern, per the helpful suggestion of @mdeland. However, there are some things that I simply cannot fully wrap my head around, and the nested function creation of lenses (at least in the JS examples I saw) unfortunately falls into that category. It did give me some insight regarding the new tack, however.

Here's the idea: whenever we 'chunk' something, we do not return an actual array of sub-values. Instead we return an array of instances of a new kind of object called Chunk. The Chunk object has a value property that wraps the concrete "thing" that is chunked from while also providing convenience methods for additional chunking.

You can see this in action on the branch eric-chunking-alt2. There are only two files that have changed: the chunking implementation and a new test file.

Chunk Properties

Property Name Type Description
value Object The underlying item that is the value of the chunk. For example, if this Chunk was created by chunking a String into words, it is a word string
canChunkInto(chunkName) Function For a given chunk name, this function looks up the available chunkers on the wrapped object referenced by value and determines if the object has a chunker by that name
chunkInto(chunkName) Function For a given chunk name, lookup the concrete function in the value's chunkers, bind the value as the context, and execute it. Return the resulting collection of Chunks. Under the hood, this is calling the chunkers[chunkName].chunk function
combineFromChunks(chunkName, chunkArray) Function For a given chunk name, look up the value's chunkers and find the matching name, grabbing the combine function. Bind the value as the context and execute, returning the result, which should be a single chunk. This is how you "reverse the chunking" of a given chunkName
get(chunkName, index) Function Attempts to chunk the value into chunks of the given name, returning the Chunk at the provided index. Returns null if the chunkName is not available for the value or if it's outside the bounds of the resulting chunk array
set(chunkName, index, newValue) Function Chunks the value into an array of Chunk objects of the specified name, finds the Chunk at the given index, and sets its value property to the passed-in newValue. This function then returns the result of combineFromChunks on the array of chunks, resulting in a single Chunk composed with the new change included

This pattern also means that anything we want to "chunk" should first be wrapped into a Chunk object. For example, if we want to get the first word of the second line of a string, we would do something like:

let exampleString = `line 1 is here\nhello this is line 2`;
let exampleChunk = new Chunk(exampleString); // exampleChunk.value == exampleString
let wordChunk = exampleChunk.get('line', 1).get('word', 0); // word is a Chunk whose value == 'hello'

New Modifications to 'chunkers'

In the earlier examples, I modified String.prototype to have an object called chunkers which itself had keys named after the various available chunk names mapped to the functions that actually do the chunking. As you'll see here for example, a given chunker object now provides two functions:

Tests and Problems

If you run or just look at the test file, you can hopefully get an idea of what is going on.

From my perspective, the problem with this current implementation is that 'setting' nested chunks is still super unclear and, with the current API, extremely verbose. This becomes obvious if you look at the very last test in the test file, which, while passing, is really ugly to look at.

I'm very much open to critiques or suggestions here. Or, if you all are better able to understand it than I am, perhaps we should attempt a version that works like lenses?

ApproximateIdentity commented 4 years ago

I must be missing something here because I can't understand why we don't just chunk objects into arrays of other objects. I.e. if we were looking at things at the type-level the function would be something like this:

    Chunk: A -> List[B]

Then there could be an inverse function:

    Join: List[B] -> A

In the simple cases we might have Chunk splitting a "paragraph" into an array of "sentences". In that case we would just have our other operations being regular indexing. I.e. if you want to grab the third chunk, then you just grab the third element of the array. If you want to set the third chunk then you do this:

  1. Chunk: A -> List[B]
  2. Replace the third value with the new one.
  3. Join: List[B] -> A

That would take a paragraph and then return a new paragragh where the third sentence is replaced.

There are of course details here, but I feel like the idea is totally straight-forward. Am I missing something basic here? I mean I'm guessing I'm just re-hashing discussions you've had on calls and I don't want to second-guess those conclusions, but I just don't understand what is so fundamentally wrong here.

I think what might be helpful for me would be something like the following:

  1. Given an explicit example paragraph, what should "get the third sentence of" return? In other words, what is the desired functionality entirely ignoring andy questions of implementation.
  2. Given an explicit example paragraph, what should "put blah into paragraph" return? Once again ignoring implementation.

We have the option to define "paragraph" and "sentence" however we want. Should it be a string? Should it be an abstract object? What methods should it have? Given that I just don't see the issue here.

darth-cheney commented 4 years ago

@ApproximateIdentity

I can't understand why we don't just chunk objects into arrays of other objects.

This is indeed the strategy I used in my previous comment. The rule in that example is that all chunking results in an array of Chunk objects.

the function would be something like this: Chunk: A -> List[B] Then there could be an inverse function: Join: List[B] -> A

This is also what I had in the chunkers in the example above, except that I called it combine instead of join.

There are of course details here, but I feel like the idea is totally straight-forward. Am I missing something basic here? I mean I'm guessing I'm just re-hashing discussions you've had on calls and I don't want to second-guess those conclusions, but I just don't understand what is so fundamentally wrong here.

For me, it's precisely those details that are confusing. Consider these two examples:

put "b" into  the last character of the third word of line 10 of card field 1
put "b" into the last character of line 10 of card field 1

In the first example we are chunking down 3 times and in the second twice -- how does each kind of chunk know how to "join" itself with respect to the chunk it was chunked from? Should it have some knowledge about the "parent" chunk? Between the two examples, as far as I can tell, joined character chunks need to know to either join themselves as words or as lines. And since we don't want to set specific hierarchies, the objects that do this need to retain some information about the chunking abilities.

Additionally, in the first of those examples, we are not actually "setting" on any of the line or word chunks, but merely "getting" them to find a single word, then "setting" the appropriate character. This makes for awkward function calling.

This is the task I took up in my previous comment. I've already done what you suggested, I think. The problem is that the API is awful (see the very last test in the test file).

Given an explicit example paragraph, what should "get the third sentence of" return? In other words, what is the desired functionality entirely ignoring andy questions of implementation.

Since this question is asking about String chunking specifically, we probably want to return a String. In my latest implementation -- the comment above yours -- I decided it was easier to have it return as Chunk instance whose value is a String, but that might not be the way to go (ie, wrap everything in a Chunk before chunking). "Getting" has not really been a problem so far though.

Given an explicit example paragraph, what should "put blah into paragraph" return? Once again ignoring implementation.

This is a bit of an open question, and one of the reasons for this github issue. Strings are immutable in JS so we cannot change them in place. For Strings, setting any subchunk will ultimately have to return a copy of the whole modified String. Do we want to say now that we will always return copies of the original chunked item, modified?

We have the option to define "paragraph" and "sentence" however we want. Should it be a string? Should it be an abstract object? What methods should it have? Given that I just don't see the issue here.

Yes, but one of the points in opening this issue was to try and address those exact questions. For me this whole thing has been "easy in theory" but not in implementation. Now I could be making a complete mess of it by overcomplicating, and that's also kind of what I'd like to know. Strings will not be the only things that can be chunked, so I've been trying to find the most general pattern I can. We can always go the verbose route of making new object types for each chunk (ie StringWordChunk, StringLineChunk, FooObjectChunk, whatever) and attaching capabilities to those. But I hope it's clear why I was trying to avoid that.

ApproximateIdentity commented 4 years ago

I think I understand what you're saying. I'm going to change the terminology a little to ignore any past implementation. Say you have the following pseudo code and concepts of producing sentences and words:

paragraph = "Sentence one. Sentence two."

paragraph.split_sentences() = ["Sentence one.", "Sentence two."]
paragraph.split_words() = ["Sentence", "one", "Sentence", "two"]

Then you have the following main question: Does joining words back up produce sentences or paragraphs or does it depend on what you produce?

This is actually just equivalent to us defining what semantics we want. This really makes me think it's even more important to write out explicit examples to see what we want to return. Without that I don't see how we can even think about implementations anyway.

By the way, the fact that JS strings is immutable doesn't really matter. We can write whatever implementation we want. Do we want "strings" (e.g. the class of string we use in SimpleTalk) to be immutable? Then we can do that. If we want it to be mutable we can have that as well. We'll just have to write the class/methods to match what we want. Internally somewhere we'll be storing using JS strings, but the fact that they're immutable there doesn't really have any effect on this because we pick our implementation no matter what we do.

ApproximateIdentity commented 4 years ago

The problem is that the API is awful (see the very last test in the test file).

Meant to address this. I think we should write up our idea of what we want in the simpletalk grammar not in a JS api. The JS api can be whatever we want. What I'd like to do is have a list like the following:

Then once we have that we can start implementing it. But without that I'm not really sure where to take this. I mean I personally can't really say what implementation I like since I don't know the goal of the implementation.