ezrosent / frawk

an efficient awk-like language
Apache License 2.0
1.24k stars 34 forks source link

Implement First-Class Functions (aka `gawk` Indirect Calls) #82

Open chrisdotcode opened 2 years ago

chrisdotcode commented 2 years ago

Seeing as how awk would be greatly enhanced by the addition of first-class functions (FCFs), it would be nice to add them to frawk.

It turns out that gawk already has an implementation of FCFs in the form of Indirect Calls, introduced in version 4.0. Here's an example:

function call(f, x) { return @f(x) }
function double(x) { return x*2 }
BEGIN { print call("double", 2) }

While an extra sigal @ is introduced, and the function is called in a "stringly-typed" manner, I still think it would be wise to hop onto an already-existing implementation (and hopefully indirect calls make it into POSIX!) instead of trying to create some alternate.

Since I'm new to LLVM/cranelift, if I was willing to volunteer to implement this, @ezrosent, would you support it/help me work through the codebase to implement it?

ezrosent commented 2 years ago

Hi! Thanks for the interest,

I'm not opposed, but this sort of feature will be pretty difficult to add to frawk (moreso than other features that gawk has that frawk currently does not, like multidimensional arrays). The problem is that the dataflow analysis that frawk currently uses to infer types will not work for indirect functions, and I'm not sure how to fix it while maintaining the "static" character of the compilation model (where functions are compiled once, ahead of time). Other paths forward might be to (a) be more dynamic or (b) implement FCFs in a way that supports 'defunctionalization' (e.g. as is done in Furthark). Either of these would be pretty difficult, and the latter would take us pretty far afield from how things work in gawk. Happy to elaborate on this more if you'd like me to go into further detail.

Out of curiosity, do you use this feature in gawk? My guess here is purely anecdotal, but I've always thought that multidimensional arrays are more popular than the indirect call functionality. Either way, I'm definitely curious to hear about what features people use and enjoy. I use first-class functions pretty often in most languages I write on a regular basis, but I've never had occasion to use gawk's implementation.

chrisdotcode commented 2 years ago

My rudimentary intuition for implementation was that indirect calls could use the function table and search for a function by name, and then either:

  1. create a new, combined function for each occurrence, (a la type erasure in Java) (also cf. http://www.btellez.com/posts/fp-awk.html#higher-order-functions).
  2. compose the functions in Rust, and then resume static analysis from there.

Do either of these work with frawk's current implementation? My bet would be the first one is more tenable.

I'd definitely be interested in further detail if you had the time.

As far as usage is concerned, I feel as if it's a chicken-and-egg problem: since most people don't come from a functional background, most people wouldn't even think to use indirect functions as FCFs. And since there's no first-class support, I often end up monomorphizing the functions by hand myself (map => map_add2, filter => filterIsEven, filterIsEmptyString, etc.). filter is really the biggest one I miss, and I think really jells well with awk's philosophy: filtering rows based on a particular criteria.

I feel as if multidimensional arrays can be approximated decently enough (although I would still want to see them implemented at some point!), but there's really no substitute for the polymorphism that language-level FCFs would bring.

chrisdotcode commented 2 years ago

Relatedly, it would be really sweet if there was also lambda support that went along with this, so that you could compose FCFs and anonymous functions, a la: filter(\x -> x % 5 == 2).

ezrosent commented 2 years ago

Sorry for the delayed reply, and thank you again for your interest in contributing to frawk and your suggestions for improving it!

Technical Challenges

frawk compiles a separate function for each sequence of argument types it might get passed. Because the number of distinct invocations is exponential in the number of function arguments, frawk makes sure to only compile instances of the function that might get called.

This leads to the first problem with indirect functions: we cannot in general predict what the value of the string determing the name of a function will be, which means we have to choose between:

  1. Compiling more functions than we need, or
  2. Recompiling new functions dynamically "as they come in"

The first option could potentially lead to major cliffs in compilation time when we cannot suitably reduce the number of functions compiled ahead of time. Recompiling new functions and performing a table lookup at runtime is probably the preferable choice here, but it too has a number of downsides, it:

This is all in addition to the complexity of invoking the compiler at runtime, which is something that frawk currently is not set up to do.

But there's a second problem that I think may be more difficult than the first to resolve satisfactorily, which is how a function call in frawk affects the surrounding program outside of the function. For example, with the function x:

function x(m, k) {
    return m[k]
}
  1. The returned value depends on the type of values stored in the array m.
  2. The inferred type of the array is also determined (again, ahead of time) by the types of keys that frawk thinks might be stored in the map.

(1) means that it wouldn't be sufficient to compile new instances of a function on the fly (or even compile all possible functions ahead of time): to compile the calling function we would need to have some idea of what the return type is, and in general we cannot know. Any solution here needs to have a satisfactory answer for how to "guess" a return type that doesn't result in dramatically worse performance (e.g. making all scalar return values strings).

(2) means that we would have to either treat indirect function invocations specially (e.g. not allow them to participate in type inference; this would lead to some big difference in semantics from gawk), or make extremely conservative assumptions during type inference for programs that use indirect functions, making for a confusing source of performance problems.

Note that a lot of these issues are specific to adding a mechanism that lets you call a function based on a string representing its name. There are certainly better ways out there to implement a feature like this!

Skepticism about higher-order constructs.

As for whether or not wider availability of functional programming constructs is standing in the way of the programming style being adopted in Awk programs, I don't think I agree with your reasoning. Gawk seems to be far and away the most popular distribution of the language, but (again, based on anecdotes --- I'm happy to be proven wrong) the feature does not seem to have garnered widespread use. I agree that other features like lambdas and returning arrays (which frawk does let you do), would improve matters but I do not see them as the biggest problems here.

Awk is a language built around imperative constructs and it lacks the abstraction power to allow programmers to build idiomatic constructs from the world of functional programming. For example Awk lacks good support for immutable data-structures or lazy streams (with the possible, limited exception of standard input, which is operated on by a stateful, imperative program and cannot be composed easily except via a shell pipe to a new awk program).

I'm somewhat open to extending frawk with some form of higher-order functions, but I think we would want to design a new construct that allows for a more static compilation model (perhaps along the lines of the Futhark paper I linked above). I am, however, skeptical that this alone would allow frawk to support more functional idioms that we get in more feature-rich languages.

Lastly, I will note that there's active work on a language in the same vein as Awk called Jacinda written from the ground up with functional programming in mind, based on your interests I think you might find it cool!

chrisdotcode commented 2 years ago

Thanks for the detail and link to Jacinda, @ezrosent!

Technical limitations aside, the reason I suggested Indirect Calls as an implementation for FCF is because awk's biggest strength is its POSIX specification (and subsequently gawk compatibility). With the exception of the few edge cases, I like frawk because it lets me use my already-written POSIX scripts with the frawk runtime for free speed++. Likewise, I can take scripts I've written locally with frawk, and transfer them onto any POSIX-compatible server on the world, and have them run with no modification. Obviously a huge productivity boon.

Since, as you mentioned, gawk is leading the way, I foresee Indirect Calls ending up in POSIX in a few years, and as a result, by implementing Indirect Calls now, frawk would keep POSIX/gawk-compliance(ish, I know that POSIX compliance isn't strictly a goal of frawk), and get FCFs for free. Choosing an alternate implementation of FCFs means that frawk gets better, but then all of a sudden, I now have .awk scripts and .frawk scripts, instead of just the one (and yes, I know that gawk has differences from awk, but practically speaking, gawk is the most popular and hence compatibility is worth shooting for, IMO). And I lose all POSIX-compatibility (instead of just incidental compatibility) and installed-by-default-ness.

Tangentially, for the most part, I see awk as a mostly-functional language, but perhaps that's because I already have a functional background. I'm sure that my usage of {fr,g}awk is possible non-normal, but awk seems to be good for many things people reach for other languages for (mostly because of lack of knowledge, I presume), but I do think awk would hit that last milestone of power with FCFs.