amenzwa / pfd

This project reimplements in Elm the data structures presented in the book "Purely Functional Data Structures" by Professor Okasaki (1999). Okasaki first published this material in 1996 in his PhD dissertation at CMU. The book provides Standard ML and Haskell implementations.
MIT License
38 stars 1 forks source link

pfd

purely functional data structures in Elm

This project reimplements in Elm the data structures presented in the book Purely Functional Data Structures by Professor Okasaki (1999). Okasaki first published this material in 1996 in his PhD dissertation at CMU. The book provides Standard ML and Haskell implementations.

Every popular algorithm textbook takes an unapologetically imperative stance. It was, and still is, the tradition in Computer Science (CS). For decades, functional programmers had to resort to imperative techniques that update the data structures, in place, thereby clobbering them. This approach is incongruous, inelegant, indiscreet. But it is the accepted practice. Indeed, I published an earlier project, called CLRS, that implements standard, imperative algorithms presented in the well known textbook Introduction to Algorithms by Professors Cormen, Leiserson, Rivest, and Stein (CLRS 4ed 2022; formerly CLR 1ed 1990). That project uses Python—my apologies.

Okasaki's monograph, however, is unique; it is the first, and to this day the only one of its kind, that plumbs the depths of purely functional data structures. It is a tour de force of functional thinking. It draws from the shared mathematical tradition that underlies all branches of CS. If you are a CS student or are starting out in functional programming (FP), this book should be on top of your reading list. Even if you find no use in this project of mine, you should buy that book and read it, cover to cover.

I chose Elm, because it is a purely functional language. Elm is designed for web front-end (browser) development work. Its purpose is to eliminate the all-too-common error on the web today: JavaScript's version of the null pointer problem. Consequently, Elm ended up as a decent FP language with a superb type checker and a wealthy ecosystem. Moreover, Elm is a pleasant language in which to program. Its type checker is singularly effective, especially when refactoring. Type checkers in FP languages treat programmers with a laconic disdain. Elm's type checker, however, encourages programmers to do better, and provides clear, cogent, constructive critiques. This makes Elm a good candidate in which to study FP, perhaps even as the first programming language.

Be that as it may, I view Elm as a pragmatic language in which to teach CS students disciplined web development, after they have learned FP in Haskell, OCaml, or Standard ML. This project could be of use for that purpose. The primary audience of this project, therefore, is the CS students studying FP and data structures. The secondary audience is Elm users who need purely functional data structures.

Officially, Elm will not support back-end (server) development work. There is PineVM to remedy this "front-end-only" debility. There is also Elchemy, which is Elm for the Erlang BEAM, a VM with performance and resilience that had been honed over the past four decades in the telecommunications industry.

I could have used PureScript, another purely functional web development language that is used for both front-end and back-end work. But PureScript is just a Webskell. So, Okasaki's original Haskell implementation could readily be adapted to PureScript, perhaps even through automation. And it would be just as pointless for me to use OCaml, another great FP language with long-established reputation. Because OCaml is just ML with objects, Okasaki's original Standard ML implementation can be adapted to OCaml with minor syntactic tweaks.

Lastly, this is an ongoing project. The initial commit, in mid 2023, includes a few traditional data structures, like binary search tree, sets, stacks, heaps, and the like, but in their pure functional form, along with the tests. The code, as well this document, will eventually contain detailed commentaries that further explanation some of the trickier concepts and techniques presented in the book.

INSTALLATION

At present, Git-cloning is the only way to install this project. This is acceptable, since the intended use is for small class projects of CS students. When the project has reached a level of maturity and stability, I may publish it as an Elm package.

First, install Elm by following the official installation instructions. And because Elm is technically a web framework, we need to install Node. The easiest way to install Node is through NPM. So, follow the official installation instructions for NPM. Also install VSCode, again by following the official installation guidelines, along with the Elm tooling VSCode plugin. Do not use the older Elm Language Support plugin.

Next, type in the following commands at a terminal in a directory under which you wish to clone this project. On macOS, we could go to ~/Documents/.

$ cd ~/Documents
$ git clone git@github.com:amenzwa/pfd.git

Although the Elm compiler cannot produce a terminal-friendly, command-line application, we have something that suits our purpose well: the elm-test framework. We install it as follows, after having installed Elm and NPM.

$ cd ~/Documents/pfd
$ npm install -g elm-test elm-format elm-review

The above command installs elm-review that helps improve your Elm code and elm-format that formats your Elm code in accordance with the strict, community standards. After you have modified a piece of Elm code, always reformat either using VSCode or elm-format from a terminal.

Once all those bits have been installed, we can run the PFD tests, from the top directory of the project.

$ elm-test

ORGANISATION

The Elm implementations are organised as follows under the ~/pfd/src/ source directory:

STUDY

When you study Okasaki's PFD, and indeed any mathematically inclined textbook, read it in at least three passes: scan, dive, and climb.

First, scan the chapter. Pick up some key terms and concepts. Take in the figures. Do not worry about understanding every concept you encounter; just collect the key phrases.

Next, reread the chapter, but this time, dive deeply. Fill out those terms and concepts you picked up in your initial scan. Do not stop until you have grasped all the concepts presented in the chapter. Typically, the concepts presented in a chapter rely on those presented in the earlier chapters and, sometimes, reference those yet to be presented. If you have not read the earlier chapters, at least read deeply the cited section, and follow it up the citation chain, thoroughly. Once you have thoroughly studied the concepts and the dependant ones, take a break.

Upon your return to studying, review the material once more at a very high level, before moving on to other topics. This climb is similar to scan, but this time, you are skimming the cloud tops with a full knowledge of the material; you are no longer wandering and probing about in the dark.

There are two ways to study this monograph. You can read the whole of PFD, cover to cover, before you read the Elm code provided in this project, or read one chapter at a time, studying the Standard ML code in the book, then read the equivalent Elm code given here. Choose the approach that works for you.

I have documented every module with detailed citations to PFD, chapters, figures, page numbers. There are line-comments, too, citing to specific pages from whence the idea or the algorithm came. In the tests, I have included not just API call tests but also tests that check the validity of various properties of the data structures during computation. The sources for these checks are the mathematical properties given in theorems, lemmas, remarks, and exercises in the book.

CAUTION

syntax and semantics

In CS, we teach FP to undergraduate students, because its semantics serves as a base upon which to construct higher learning. In IT, we use FP, because these languages have cute syntax—therein lies the dangers.

Object-oriented (OO) languages have been well entrenched in IT since the early 1980s. Objects and business models fit like hand and glove. Then, why did IT folks begin flirting with FP, starting around 2010s?

In the 1980s, IT began adopting OO, en masse, as a means to reign in the 1970s procedural programming (PP) paradigm's laissez-faire attitude toward mutation. OO's obsession with hiding mutations was just the right technique to civilise the single-threaded PP programmes of the era. But this characteristic now prevents programmes from being able to exploit the power of cheap, multi-core CPUs. In the 2010s, the industry started looking to shared, immutable data as the necessary ingredient for multi-core exploitation, and OO's hidden mutation is now seen as the bottleneck. FP thus offers a salvation.

Since the late 1970s, CS academics have been admonishing the IT industry to mature from PP into FP. But the industry satisfied itself with the OO quick fix, instead. Only now, when the industry is drowning in a sea of inter-mutating objects, do the IT practitioners begin to appreciate FP's innate powers. And it was only recently that the industry tried to make a massive shift from OO to FP.

But old habits die hard. When C++ arrived on the scene in the early 1980s, many PP programmers began transitioning to OO. They wrote C-style PP semantics code in C++-style OO syntax. C++, a PP-OO hybrid language, was immense popular, because most everyone in IT was then a C coder and C++ looks very much like C. In those early days, C++ did more harm than good to the efforts of those who are trying to embrace OO, because it made it too easy to fall back on the bad, old PP ways. Smalltalk, a simple, purely OO language, was far better OO instructional language, but it never took hold in the industry. Then, in the mid 1990s, Java came charging onto the crime scene. Java is essentially a PP-OO hybrid language with Smalltalk-semantics and C++-syntax. Java helped complete the industry's tortuous transition from PP to OO, and most everyone in IT is now an OO fancier.

Today's OO programmers are trying to adopt FP in a manner not unlike yesterday's PP programmers tried to adopt OO: they keep writing OO semantics code in FP syntax. All popular, new languages that came out around the 2010s have been OO-FP hybrids: Scala, Clojure, Kotlin, TypeScript, Swift, etc. Indeed, some are PP-OO-FP hybrids, which is worse. And using these languages to learn and adopt FP suffers from the same pains of the 1980s C++-driven OO transition.

Pure FP languages, like Haskell, Agda, Idris—and Elm—offer a disciplined way to transition from OO to FP. But it is plain to see that the IT industry, as a whole, has shunned these paradigmatically pure FP languages, and has clung on to OO-FP hybrids, out of habit. I do not believe this is an effective way forward.

PP has its value. Most embedded software targeting tiny MCUs are best implemented in plain C. OO has its value. It is impedance matched to small-scale business applications and simulation software. FP has its value. Its mathematical nature and its innate immutability is perfect for high-performance scientific computing, which are intensely mathematical and rely on parallel processing. So, be open to all that is available to you. But at the same time, when you are working with FP, shed your OO and PP propensities. Shun ref, for, class, and other indicia of the imperative paradigm, at least while you are learning to think functionally.

When learning a programming language, do not focus on the syntax: do not object to it, do not be enthusiastic about it. The purpose of syntax is only to provide programer comfort. The real reason why we write programmes is to convey the semantics of the problem being solved to the machine, and more importantly, to other programmers who have to maintain our code. So, learn the language's semantics, idioms, libraries, and local cultural norms. For these reasons, I reject the notion that new language should copy the syntax of established languages, so as to ease the transition. This was done with C++, Scala, Kotlin, TypeScript, and many others that adopted C's blocky syntax. But most programmers disagree with me on this point. You do you.

Do not learn a new language just because everyone else is using it; learn it because it will expand your mind effectively and solve your problems efficiently. Do note, though, that an experienced programmer acquainted with all the major paradigms still requires at least a couple of years of sustained use to become proficient in a large, modern, OO-FP hybrid language. But one thing is clear. Elm is a tiny language equipped with many tried-and-true, pure FP facilities. It can be learned in a couple of days, and it can be used in JavaScript web applications. You choose your poison.

laziness in all its glory

Most programmers today are aware of the concept of laziness, but they are unfamiliar with the implications. This is because all popular modern languages are OO-FP hybrids with a serious leaning toward the imperative. The imperative paradigm does not countenance laziness. Let us explore the meaning of laziness.

If a language incrementally evaluates expressions only when their values become necessary, and only up to the point to make progress with the current computation, then memoise the computed values for subsequent reuse, the language is said to use the lazy evaluation strategy. Mind you, this requires no effort from the programmer. Lazy evaluation is an optimal implementation of the $\lambda$-calculus notion of normal-order reduction. Laziness must not be confused with delayed computation, which is a computation suspended at the present moment. When the computation is eventually resumed, it monolithically evaluates the entire expression in one go, and the results are not memoised. Typically, the programmer is responsible for implementing delayed computation.

So, both lazy evaluation and delayed computation perform the work at some future time. Lazy evaluation proceeds incrementally and caches the results for later reuse, whereas delayed computation does not cache the results and completes the task monolithically.

Languages that use call-by-value parameter passing mechanism are said to be eager languages, whereas those that use call-by-need parameter passing mechanism are said to be lazy languages. When a function is invoked with call-by-value semantics, it is passed arguments that have been fully evaluated. When a function is invoked with call-by-need semantics, its arguments are passed unevaluated, and they are evaluated only when the function needs them, and only up to a point when those needs have been satisfied. Moreover, the evaluated results are cached by the language, so those computations occur only once. So, a lazy language pays for computational efficiency with extra space expenditure.

Note that eagerness and laziness are language-wide strategies used in evaluating expressions. But call-by-value and call-by-need apply only in the context of function calls. It just happens so that languages that use eager evaluation strategy also use call-by-value parameter passing and those that use lazy evaluation strategy also use call-by-need parameter passing, because such pairings are natural.

Anyway, I just wanted to clarify these terms, because CS folks bandy them about without explanation and IT folks confuse them without examination. Now, onward to a more pressing issue.

There is a fundamental conflict between the amortisation analysis technique and the purely functional implementation technique. See Chapter 6 p.57. Traditional algorithms analysis is imperative to the core. It assumes the ephemeral (mutable) nature of data structures and relies on the mutator being a single thread of computation to preserve consistency. But purely function data structures are inherently perennial (immutable). Any alteration to an existing data structure produces a new copy thereof, without destroying the original. This permanence of data is called persistence. And being read-only in nature, purely function data structures are well matched for simultaneous use by multiple threads of computation. Parallelism and persistence invalidate the theoretical performance guarantees attached to single-threaded, amortised data structures. It turns out that lazy evaluation is the necessary mediator that restores peace between these two waring factions. See §6.2.2 p.59.

It is no surprise, then, that more than a third of Okasaki's book is devoted to lazy data structures. But like all modern programming languages, Elm lacks a built-in lazy evaluation mechanism. Indeed, Standard ML does not have built-in lazy evaluation mechanism either. But there are non-standard lazy extensions to the language. Okasaki used one such extension to implement the amortised, lazy data structures.

At present, Elm has no lazy extension, at the language level. There was a built-in Elm lazy facility, but it has been deprecated (read decapitated) as of Elm 0.19.1, the latest version at the start of this project. As is the wont of the Elm community, there is a revival of this lazy facility, but it is not part of the Elm canon.

I have not resolved how to incorporate laziness into this project, mainly because I have been lazy.

features or the lack thereof

Elm has parametric polymorphism in its algebraic data types, but it does not have ad hoc polymorphism, neither C++-style function overloading nor Haskell-style type classes. In earlier versions, Elm supported operator overloading, but it was removed in v0.19. Given the narrow scope and small size of this project, however, Elm's lack of ad hoc polymorphism is not an infirmity.

Likewise, Elm 0.19 limits tuples to triples. While this restriction can be justified on philosophical grounds, it is not a practicable choice. It is true that tuples should only be used to represent small product types, like Complex, and Point. But in FP, pattern matching against tuples is common as clay and, if used judiciously, is a powerful, convenient technique. Clearly, no one should use 10- or 20-tuples. But forcibly limiting it to a triple is unkind.

Another irksome trait of Elm is its lack of the $\bot$ crash facility, as in Standard ML raise or Haskell error, thus mandating the use of Result, everywhere. This, however, makes sense in the front-end context, since the user should never see a crash. But in the education or demonstration back-end context like this project, it is mighty inconvenient. So, to simulate a $\bot$ crash, I force an infinite recursion, thereby inducing a stack overflow crash. The absence of a $\bot$ crasher is unfortunate.

Elm's community-standard formatter tends to spread the code out vertically, instead of horizontally. Often, a variable would show up on a line, all by its lonesome. And double-blank lines are injected, willy-nilly. Given that computer monitors are wider than they are tall, this indiscriminate use of the vertical real estate is wasteful, and it diminishes the amount of information that the programmer can absorb in one glance. But it is the format upon which the Elm community has settled. They justify this canonical format as a means to enable everyone to read anyone else's code. In the days of advanced code formatters, this is one of those weak-kneed arguments proffered by a strong-armed majority. In any case, we play by their rules, on their court. The community prides itself on this and other compliant conduct which, at times, can be unhelpful.

CORRECTIONS

In the initial release, I made a wrong claim that Elm does not support inner functions. Eniac314 on GitHub pointed out my mistake. To prevent propagating my mistake, I have taken out that erroneous complaint from this document. And have refactored the auxiliary functions in my code to use the inner functions, as suggested by Eniac314.

Another error on my part was to assume that I could induce a $\bot$ by forcing an infinite recursion. But as pointed out by Leonardo Taglialegne on GitHub, the Elm complier optimises this tail recursion, thus preventing the intended stack overflow crash. I have followed his suggestion to inject the identity call to the left of the infinite recursive call, thereby circumventing the compiler's tail-call optimisation.

I offer my sincere thanks to these generous GitHubers who used their times to correct my errors.

CONCLUSION

Elm's limitations enumerated above amount to mere inconveniences when it is used in a limited way on a small scale, like a class project or a self-study project. And there are adequate workarounds that do not detract from Elm's innate elegance and its suitability as an FP teaching language. I hope young CS students and junior IT practitioners would peek over their Python and JavaScript desks and glance at Elm. That this generation has access to free, abundant, solid tools, like Elm and other open-source software, is indeed fortunate.

sources for courses

If you intend to study Okasaki's PFD as the source material, you must already have taken at least the introductory courses on discrete structures (mathematics, hence declarative), data structures (imperative), algorithms (imperative), programming languages (all paradigms) and functional programming (declarative).

Indeed, you should study other foundational CS topics—formal languages, computability theory, complexity theory, category theory, compilers, etc.—either in a classroom setting or in a bedroom setting. Although these topics are inessential in the practice of modern, mundane IT, they are the indispensable foundations of CS, and your possessing such knowledge will only help, not hurt, your career, be it in academia or in industry. Here are some of the source materials I recommend:

THEORY

PRACTICE