Consider giving molds a `:type-vector` property to support auto composability

Sorry, this got a little long. Writing this up helped me think through things.

TLDR: If we give mold definitions an ordered list of type information, in increasing specificity, and write a few functions, moldable-emacs can synthesize composable molds without having to write them.

From what I can tell, the "moldable" concept is about defining a graph:

The vertices in the graph are states data, (i.e. a buffer) can be in
The edges in the graph are the molds, i.e. functions which transform data from state, i.e. Mold = f(S_n) = S_m
The graph is directed and permits loops, i.e. you can go from list -> csv -> list, or S_n -> S_m -> S_n.

My impression is that the power of the moldable concept comes from the fact that if you can connect novel data in a buffer with a vertex in the graph with high valency (i.e. a high number of edges), you can now leverage the full body of molds to do interesting things. That is to say, if you can make the data in your buffer look like something the moldable graph knows a lot about, you can transform that buffer into interesting things.

Currently, moldable-emacs has the concept of "composable molds". This is a manual definition of a path through the graph. Molds already define their requisite starting state, S_n which includes assertions about runtime requirements as well as data type information. If they separated out the runtime-requirements from statements about the data's state, and then declared the state they will leave a buffer in, S_m, it should be possible to write an algorithm to automatically discover what "end states" are possible for the current buffer.

This would reduce the maintenance burden of manually defining every path through the graph possible, and might surprise the user at what's possible with their current buffer.

Representing the vertices beginning state, S_n

I think there are a few pieces of information:

What libraries must be available
Any preconditions on activation
The most general type of data that can be accepted (I'll expound in the next section)

Representing the vertices end state, S_m

I think it should provide the type information of the data in increasing specificity (e.g. this is an emacs-lisp buffer, which means it's code, which means it's a tree, but it might be a tree with a pre-defined type, like an alist or plist, etc.). This is a vector of type-information, i.e. (using the previous example) <emacs-lisp, code, tree, plist>, or TV=<T₀, T₁, ..., T_n>. Note that the mold is in a unique position to have the highest resolution of what type-vector the data it produces has.

Building the graph, G

When the hypothetical algorithm tries to build the graph, it would need to walk the list of known molds, V={M₀, M₁, ..., M_n}, and for each mold, M_i:

Determine if M_i is activatable, e.g. does the running emacs process contain the necessary libraries to execute the mold. This is a filter for whether the mold enters into the graph.
Compare against each mold, M_j such that f(M_i, M_j) = { if T_j ∈ TV_i, then T_j, otherwise nil.
If T_j is returned, then we add the edge between M_i and M_j.

Using the graph, G

When a user invokes me-mold, walk the graph to find the vertex whose type-vector fits the buffer/data in the most general way, i.e. f(buffer, G) = M_i where:
- f(M_i) = required type T_i
- f(buffer) = TV_b
- T_i ∈ TV_b
- f(T, TV) = index of T, TV_n
- n < any other TV_n, i.e. the match with the lowest index wins
- In the case where there are multiple TV_n which are equal, prefer the TV with the lowest magnitude (length) | TV |.
  - In the case where there are multiple TV of equal magnitude, pick the first one sorted alphanumerically (this makes things deterministic).
Calculate the minimum spanning tree for M_i (for performance, these could be cached when the graph is constructed, and when molds are added at runtime).
Present the terminal vertices as Mold options to run.
Traversing the path through the graph may fail if one of the mold's prerequisites fails. When this happens, a copy of the graph should be made for the buffer, the edge that failed should be removed, and the minimum spanning tree should be recalculated.

The reason to use the mold with the most general matching type-vector as a starting point is so that we get the most paths through the graph as possible. Going back to our example, <emacs-lisp, code, tree, plist>, there might be some molds that can operate on plists, but not trees. If we pick the plist mold as a starting point, we are now disconnected from any other vertices that might know what to do with emacs-lisp, code, or trees.

The reason to use the mold with the lowest magnitude type-vector is because between two molds that have matched the buffer's type in equally general ways, the mold with the higher length type-vector is making the most specific claim about the data in the buffer, and we cannot be sure the assertion holds since a mold did not generate the data. E.g. a buffer is highly likely to be <emacs-lisp, code, tree>, but it is as of yet unknown whether it is <emacs-lisp, code, tree, plist>.

Implications

This approach changes molds a little. Instead of defining molds as BeginStateToEndState, e.g. "CSVToOrgTable", molds should only be named to refer to their end state, e.g. "ToOrgTable" (or I suggest, "to-org-table"). This is because that when the user is presented with options, the options will be terminal vertices, but the span between the user's buffer and the terminal vertex might be long. It would not make sense to see "CSVToOrgTable" if I'm looking at an emacs-lisp buffer. It would make sense (albeit maybe surprising!) to see "ToCSV".

Molds should become as small as possible to promote reuse.

Traversing the path through the graph may fail if one of the mold's prerequisites fails. For this reason, prerequisites should be used sparingly.

Composable molds go away, because moldable-emacs handles that messy bit for us :)

Ah! Wonderful and useful ideas. I am rereading it, but just to say this is inspiring. My thoughts were to use :examples as type definitions. I didn't explore the idea yet, but the point was to use the input and output state (buffers) as the thing on which compute the available compositions. Definitely the current composition is a hack to define a path of the graph that you "bookmark" with a name/mold. Let me read it once more for understanding your idea more in detail. (By the way, do you really write on GitHub or is an export? Maths in Markdown?!)

This would reduce the maintenance burden of manually defining every path through the graph possible, and might surprise the user at what's possible with their current buffer.

I find this pretty interesting. The thing that I like is that even bare-metal as moldable-emacs is now, you can still discover paths one edge/mold at the time. Have you thought about the UI of your vision? ~~When you compute all the possible compositions, as a user how can I find out the one I want? From a generated name?~~ I see, you would give the to-org name of the final mold. Still as a user I may want a node that is not the final: this means that you need to treat all nodes as possible final nodes. I fear a lot of options XD

Currently I list the available molds as a list of keys. I was thinking to move towards letting the user choose a mold by computing it and show the result, but maybe you got some good ideas in this regard?

I think it should provide the type information of the data in increasing specificity (e.g. this is an emacs-lisp buffer, which means it's code, which means it's a tree, but it might be a tree with a pre-defined type, like an alist or plist, etc.). This is a vector of type-information

This seems a mold that needs to be written to me. You get a mold definition and it shows you its type information, and the molds you can compose. moldable-emacs should be able to compute this implicitly, but for debugging and thinking about things, it may be useful a nice visualization :)

it would need to walk the list of known molds, V={M0, M1, ..., Mn}, and for each mold, Mi: ...

I think we could make use of a principle here: the user is always in the lead. The idea is that this is a tool to help you telling your story better.

Calculating all combinations of molds could become expensive pretty quickly, so if we have such a principle we could make use of history and reduce the problem to precomputing only combinations of molds that are used often?

This approach changes molds a little. Instead of defining molds as BeginStateToEndState, e.g. "CSVToOrgTable", molds should only be named to refer to their end state, e.g. "ToOrgTable" (or I suggest, "to-org-table"). This is because that when the user is presented with options, the options will be terminal vertices, but the span between the user's buffer and the terminal vertex might be long. It would not make sense to see "CSVToOrgTable" if I'm looking at an emacs-lisp buffer. It would make sense (albeit maybe surprising!) to see "ToCSV".

This sounds another refactoring that I should do anyway! "to-org-table" is more pleasing than "ToOrgTable"? I am neutral since I just moved from Scala to Clojure XD Very good idea, thanks for sharing it.

Molds should become as small as possible to promote reuse.

Any suggestions about this is welcome! I tend to make private molds first and add to core.el and contrib.el only those that I feel are very reusable or exemplar. It would be useful to get examples of good small molds and of too big molds among those that exist from you.

My thoughts were to use :examples as type definitions. [...]

I think it should provide the type information of the data in increasing specificity (e.g. this is an emacs-lisp buffer, which means it's code, which means it's a tree, but it might be a tree with a pre-defined type, like an alist or plist, etc.). This is a vector of type-information

This seems a mold that needs to be written to me. You get a mold definition and it shows you its type information, and the molds you can compose. moldable-emacs should be able to compute this implicitly, but for debugging and thinking about things, it may be useful a nice visualization :)

That sounds like a neat idea! But I wonder if you run into some kind of base-case recursive problem. I.e., how do you express what type-vector the example is? It might be powerful to abandon the type-vector idea in lieu of working in terms of the shape of the data, but I think that has limitations that type-vectors solve.

E.g. you have an s-expression. There are heuristics you can apply, but not reliably, to determine: is this Common Lisp? Emacs Lisp? Scheme? Which Scheme? Just an alist? (Or even this aside, in English, in parenthesis)?

Molds taking data may not care, which is why I suggested:

Compare against each mold, Mj such that f(Mi, Mj) = { if Tj ∈ TVi, then Tj, otherwise nil.

By specifying their most general acceptable type, they are more connected in the graph.

(By the way, do you really write on GitHub or is an export? Maths in Markdown?!)

Just in markdown! Maybe I haven't been spoiled by anything easier, but it doesn't seem too onerous.

Have you thought about the UI of your vision?

I realized that presenting the user with only the end-state of a mold is not sufficient because the traversal of the edges is significant (e.g. code -> AST -> cyclomatic-complexity/file -> csv is much different than code -> line-count/file -> csv). I think this prevents the use of spanning trees. I think we would have to do a DFS backtracking algorithm.

So maybe the UX is:

Present with a list of end-state options. Quite natural since the user might have an end-state in mind, or be interested in a surprising end-state presented.
Then present paths through the graph to allow them to specify what they meant.
In the event that no path looks interesting, provide some key-binding to get back to the previous end-state selection.

I think we could make use of a principle here: the user is always in the lead.

Totally agreed, and I wish more software adhered to this and was truly a user agent. It's a big reason I use emacs.

The idea is that this is a tool to help you telling your story better.

That's certainly a great use. I suggest that it can also be a tool to allow user's to solve specific problems they might have by solving general problems in molds. E.g. your excellent blog post about moving private functions to the bottom. I hate having to solve the same problem over and over again in different contexts. I would much rather solve the general problem once, and then be able to use it in many contexts.

Calculating all combinations of molds could become expensive pretty quickly, so if we have such a principle we could make use of history and reduce the problem to precomputing only combinations of molds that are used often?

I think we can make some somewhat specific statements about how expensive this might be. Here's a slightly more detailed version of "Building the Graph":

Let's assume we use an adjacency matrix to represent the graph.
For a list of molds, V={M0, M1, ..., Mn}, each mold being M_i
For a mold M_i's type vector TV_i=<T0, T1, ..., Tn>
For a map S of Type -> list of Molds which provide it
O(n): Iterate V
- O(m): Iterate through TV
- O(1): Populate S with {T_i to M_i}
O(n): Iterate V
- O(1): Check S against a mold's input type to see which Molds provide it
- O(q): Add q edges to the adjacency matrix

So, roughly:

O(n*m*1+n*(1+q))
simplified O(nm+n+nq))
simplified O(n(m+1+q))
and since m will be a very small number, let's consider it constant too, so: O(n(2+q))
We could leave it at that, or consider that q is probably also relatively small, so: O(3n).
So some kind of super-linear runtime.

Next we'd need to do the DFS. I'll hand-wave over the implementation and see that Wikipedia says it'll take something like O(n+|S|) time.

So, with this naive implementation, we can build the entire graph in something like O(3n+n+|S|), or O(4n+|S|). So still super-linear. We might be able to go faster still by somehow building all the paths as we build the graph.

And then disk is cheap, we have a homoiconic language handy, so store the paths on disk and regenerate when we add molds.

"to-org-table" is more pleasing than "ToOrgTable"?

Indeed, lispy people favor kebab-case :) We could probably even drop the to- since that is self-evident.

Really I should just code all this up instead of opining in a textbox with math. We'll see if I can find the time!

Very cool you are hanging around, again pretty interesting ideas! I really like you can express in formulas the algorithm complexity, I just had a rough intuition it could pop :D

That sounds like a neat idea! But I wonder if you run into some kind of base-case recursive problem. I.e., how do you express what type-vector the example is? It might be powerful to abandon the type-vector idea in lieu of working in terms of the shape of the data, but I think that has limitations that type-vectors solve.

E.g. you have an s-expression. There are heuristics you can apply, but not reliably, to determine: is this Common Lisp? Emacs Lisp? Scheme? Which Scheme? Just an alist? (Or even this aside, in English, in parenthesis)?

I am not sure about how I would deal with that with the "example"/data shape approach. I was thinking to make the information captured in the example more detailed, and intersect with the :given conditions. How could the mold infer the type vector? Also I guess this would work for pure molds: molds that transform data. (Just a heads up, I have some private molds that generate data by querying an API like GitHub's: the majority of these can run at any point)

I realized that presenting the user with only the end-state of a mold is not sufficient because the traversal of the edges is significant (e.g. code -> AST -> cyclomatic-complexity/file -> csv is much different than code -> line-count/file -> csv). I think this prevents the use of spanning trees. I think we would have to do a DFS backtracking algorithm.

Totally agreed. I also had an eureka moment: assuming we have something like that working efficiently enough and that we can plot the graph, we could present how introducing a new small mold could affect the graph!! And we could even make hypotheses: "user! If you develop this mold (e.g., with this input and this output), you could make the graph twice as big because you could link these two unlinked areas". That would be dreamy!

adjacency matrix

I learn a lot from your messages! I guess we need to exploit the type vector to avoid loops to keep things finite here.

Really I should just code all this up instead of opining in a textbox with math. We'll see if I can find the time!

Hey the ideas are amazing as well, and little by little I may get there. A rough&mini proof of concept would speed up things though: even a mock would be good to just get feedback about the UI. Mock driven theoretical programming :D

Very cool you are hanging around

I am very unlikely to lose interest in this project as I think it has the possibility to lift more of my work out of the mud and help me achieve consistency with some things I'm bad at.

I am very likely to not check in for long periods of time as demands on my time pull me in different directions. I think that is maybe a common understanding with open source software, but I wanted to let you know anyhow.

I still plan on contributing the algorithm discussed above, but I have no idea when (or if someone will beat me to it).

Thank you again for creating this project!

ag91 / moldable-emacs