BartJongejan / Bracmat

Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.
GNU General Public License v2.0
47 stars 5 forks source link

Comparison to Mathematica's pattern matching capabilities? #11

Closed rljacobson closed 2 years ago

rljacobson commented 2 years ago

I am a nonexpert in this field—but I aspire to one day become one! I am trying to understand the taxonomy and am interested in algorithms for the kinds of matching Mathematica is capable of (and more, e.g. order-sorted matching). Bracmat looks like a fruitful object of study.

The readme lists some of Bracmat's matching features:

Is Bracmat capable of these other kinds of matching that I am interested in studying?

In [1], Kutsia explains that flat matching is distinct from associative matching. (I have yet to study this paper carefully.) In particular, associative matching is a special case of flat matching. As far as I can tell, flat matching has not received nearly as much attention as associative matching in the literature. I am having a hard time finding implementations of flat matching.

[1] Temur Kutsia, Flat matching, Journal of Symbolic Computation, 2008, vol 43, p. 858–873

[2] Besik Dundua, Temur Kutsia, and Mircea Marin, Variadic equational matching in associative and commutative theories, Journal of Symbolic Computation, vol 106, 2021, p. 78-109

BartJongejan commented 2 years ago

Is Bracmat capable of these other kinds of matching that I am interested in studying?

If I have understood the two cited papers correctly, with flat matching (or is it variadic matching?) you could parse XML data using a pattern that ignores 'span' elements without attributes, thus obtaining a somewhat flattened view on the data. For that, you would have to declare the 'span function' as flat.

In Bracmat, I would have to deconstruct the XML data with a pattern that looks for 'span' elements and then I could construct a flattened structure, like so:

  This is not (span.,nice but also) not bad:?a (span.,?m) ?z
& !a !m !z

You may also want to look at the Egison programming language. It has a very powerful pattern matching implementation, it seems, including commutative pattern matching.

rljacobson commented 2 years ago

Thank you so much for such a great reply!

Bracmat automatically puts expressions in canonical form…

But in some contexts there may not be a meaningful canonical form. Bracmat could impose its own arbitrary notion of canonical form, but there may be extrinsic reasons to keep the subjects unmutated. In such a case, I suppose you could match against a mutated copy of the subject and say the subject matches if the mutated copy matches.

I think, though, that the situation is different if one wishes to enumerate all possible matches, as opposed to finding only a single match, yes?

…look at the Egison programming language.

Thanks! It's on my radar.

One of the challenges I have faced in other areas is how bad the source code is. As a general rule, academics tend to write uncommented spaghetti code with variables named things like tmp and vars. Egison's source code is beautifully written Haskell, and Bracmat is well-commented, solid C. If I could make a feature request, it would be to break the source into separate files to make it easier for human consumption, but I understand the reasons for a single file.

BartJongejan commented 2 years ago

You are right that there is not always a meaningful canonical form. Here is an example:

x+y+a*x+a*y+2*b*x+2*b*y                                                                 (1)

vs

(1+a+2*b)*(x+y)                                                                         (2)

These two expressions are equivalent and it is hard to argue that one is more canonical than the other.

Consider that we add

-2*b*y

then what do we expect to be the most canonical answer?

-2*b*y+x+y+a*x+a*y+2*b*x+2*b*y                                                          (3)

evaluates of course to

x+y+a*x+a*y+2*b*x                                                                       (4)

But what about

-2*b*y+(1+a+2*b)*(x+y)                                                                  (5)

Another example. Suppose we divide by

(x+y)

In the first representation, we get;

(x+y)^-1*(x+y+a*x+a*y+2*b*x+2*b*y)                                                      (6)

In the second case, we have

(x+y)^-1*(1+a+2*b)*(x+y)                                                                (7)

which immediately is simplified to

1+a+2*b                                                                                 (8)

In Bracmat, (1) and (2) are both stable expressions. With that I mean that (1) evaluates to (1) and (2) evaluates to (2). Expression (5), however, is not stable. The rule is: if an expression is a product of sums, then the expression is stable. If an expression is a sum of products, and none of the factors are sums, then the expression is stable. So (5) is not stable, since it is a sum with more than one non-trivial terms (0 being a trivial term) and at least one of the terms is a product that contains a factor that itself is a sum. So (5) evaluates to (3) and then to (4)

-2*b*y+(1+a+2*b)*(x+y)  ===> -2*b*y+x+y+a*x+a*y+2*b*x+2*b*y  ===> x+y+a*x+a*y+2*b*x     (9)

And so we see that adding a term to (1) and to (2) result in the same expression, (4)

The other example could go along the same line, but it doesn't. Factorization is an expensive operation. Therefore, Bracmat does not attempt to simplify (6) to (8). As a user of Bracmat, one has to decide to spend the processing time that is needed to simplify (6) so one obtains (8). That can be done by calling the fct (factorize) function:

fct$((x+y)^-1*(x+y+a*x+a*y+2*b*x+2*b*y)) ===> 1+a+2*b                                   (10)

Now, note that we did some "mutations" in these examples, but they were always much more complex than commutations (swapping the order of terms in sums and swapping the order of factors in products). So I conclude that commutative pattern matching is not the right programming construct for establishing that (1) and (2) are equivalent, or (6) and (8).

If one removes all programmability from Bracmat, then all evaluations in the above examples would still happen, except (10), because the fct function is not written directly in C. If it were not for expensive operations such as factorization, Bracmat would not have become a programming language. It would still be a non-programmable calculator for symbolic expressions, like its predecessor that run on an Amstrad CPC464. The only operators would have been 'mathematical' ones: summation, multiplication, exponentiation, logarithm and differentiation.

Just to be able to factorize a polynomial, I implemented pattern matching, pattern variables, functions, and flow control. While doing that, I got a professional interest in Natural Language Processing. I added white space, comma and period as binary operators. These operators have much less 'autonomous behaviour' than the mathematical operators. Of the three, white space became the most interesting, being associative. (concatenation, the empty string being the 'zero' or neutral element).

I thank you for desiring to study the C source code. May I ask what you hope to find?

I am fully aware of the fact that the source code is hard to read. I myself always need ample time before I understand my own thoughts when I wrote it. I have to follow the logic line by line, and then I conclude; yes, that is how it is done, and it is done right. Sometimes I add a comment, because I think I know how to formulate what task is solved, and sometimes I delete a comment, because it sits in the way.

I am still not convinced that cutting up the source code is a good idea. It is very unconventional to have a source file with over 16000 lines, sure, but would it get any better to have 10 source files with 1600 lines each, or 100 with 160 lines each? Wouldn't it become more difficult to navigate through the source code in e.g. Visual Studio with so many source files? Will compilers still be able to decide that some functions can be inlined?

I like to use function declarations as little as possible. A function declaration is only used in the bracmat source code if two or more functions call each other. So, if you see a function declaration, you know at once that there is something special about that function. If the source code were split up, you would not know that, because there would be many more function declarations - in header files. You would also not be able to have an immediate impression of whether the function you currently are looking at sits high or low in the calling hierarchy. In bracmat.c, main() is the last function, because it depends on everything else, while functions that only depend on the standard C libraries sit nearer to the top of the file.

There are two more, much smaller source files, for parsing XML and JSON respectively. They are separate from the rest, because they could be reused in other programs, with no or only few changes. On the other hand, nothing in the big source file is of any use outside Bracmat, I think.

I understand that you use Mathematica. I have never had the opportunity to use that tool/language. And it seems that you are going to look at other pattern matching languages as well. I am very interested to know what you find out.

rljacobson commented 2 years ago

Once again, thank you for such a thoughtful and complete reply.

I have spent several hours today studying (Dundua 2021), which describes a formal framework for matching in associative and commutative theories. A key component of their system is a normal form for terms. The ordering of function symbols, terms, and term sequences is arbitrary but fixed.

An associative normal form (A-normal form) of a term or a sequence is obtained by rewriting it with the associativity axiom from left to right as long as possible. …

We introduce a strict total order on function symbols and extend it to ground terms and term sequences so that the obtained ordering is also total. A commutative normal form (C-normal form) of a ground term is obtained by rearranging arguments of commutative function symbols to obtain the minimal term with respect to the defined ordering.

The system does not account for distributive functions, so mathematical expressions (for example) are not semantically unique w.r.t. normal form. They are only unique w.r.t. the chosen well-founded total ordering, associativity and commutativity. My mental model is to think of it as an extension of lexicographic ordering.

So I conclude that commutative pattern matching is not the right programming construct for establishing that (1) and (2) are equivalent, or (6) and (8).

Agreed. The systems I am familiar with, which admittedly aren't many, all use ad-hoc heuristics for distributivity and algebraic cancellation. But as far as I can tell there is no theoretical difference between rewriting using the associative and commutative axioms versus rewriting using the distributive axiom or algebraic cancellation. It is a judgment call based on, for example, informal assessments of the complexity of the rewrite rule(s), estimates of code complexity, availability of theoretical frameworks like that of (Dundua 2021), and so forth.

Distributivity is suspiciously absent from the literature I have surveyed.

May I ask what you hope to find?

I have long been interested in pattern matching, because it has "fingers" in so many different areas of mathematics and computer science. Recently I have had the time, energy, and inclination to fill some rather large gaps in my understanding of pattern matching and unification. I have been learning about theoretical foundations, and I would now like to learn how these algorithms are implemented.

My professional career has oscillated between mathematics and computer science. I recently transitioned out of academia to a position in industry. There is no direct connection to my professional work right now. But the approach I have taken in my career is to learn about and do the things I really enjoy and then make my professional life accommodate that. It is a very fortunate accident of history that the skills I have acquired doing this happen to be extremely marketable right now. But this is more autobiography than you asked for.

…would it get any better to have 10 source files with 1600 lines each, or 100 with 160 lines each? Wouldn't it become more difficult to navigate through the source code in e.g. Visual Studio with so many source files?

Yes, I think it would be better from the programmers perspective. Splitting code into separate files, separate groups of files, and separate libraries has the same utility as splitting up a book into chapters, sections, and subsections. It is easier to understand something when you know what you don't have to pay attention to. And it makes the source code easier to navigate. For example, if you know you want to fix a bug in the parsing routines, you know that those routines are probably in parser.c. I do not think it would be much better to have hundreds of files in a single directory, but I think there is an optimum somewhere between those two extremes. If the project is large enough to have hundreds of files, they should be organized into folders and subprojects and libraries.

I also think there is a practical advantage from a software engineering point of view, because it encourages the programmer to think of the program as a modular system with discrete components that work together. While the language features that enable modularity have varied over time and between languages, I think the history of software engineering has converged on the principle of modular design.

Will compilers still be able to decide that some functions can be inlined?

They are getting better at link-time optimization, but it's still not as good as having a single compilation unit. On the other hand, is this a case of premature optimization? There is no replacement for empirical measurement.

In any case, you can have separate source files and still have them in the same compilation unit by using #include.

So, if you see a function declaration, you know at once that there is something special about that function.

The C equivalent of ML's let rec. Knowing that two functions are mutually recursive has either been trivial or else has not been useful/important to me for the kinds of programs I write. But if it is, then that information can be recorded in a comment. Also, modern IDEs can show every reference to any given symbol.

At the end of the day, what matters is what makes things easiest for you as the author, maintainer, and (presumably) primary user of the software. It doesn't matter what my arguments are for doing things differently if it makes your job any harder. I am just grateful you have shared your code with the world.

[Dundua]: Besik Dundua, Temur Kutsia, Mircea Marin, Variadic equational matching in associative and commutative theories, Journal of Symbolic Computation, vol. 106, 2021, p. 78-109, ISSN 0747-7171, https://doi.org/10.1016/j.jsc.2021.01.001.

BartJongejan commented 2 years ago

"There is no replacement for empirical measurement." - there you got me. I am currently isolating the "lambda" function in a separate C file. I will see where that brings me.

rljacobson commented 2 years ago

I am closing this issue to keep your open issues list uncluttered. Thank you for the great exchange!

BartJongejan commented 2 years ago

The broken-up source code is in the /potu subfolder. As I explain in README.md, the source code can be compiled and linked in two ways, either by aiming the compiler at potu.c, which then includes all other .c files, or by compiling each .c file separately and linking afterwards. Following the first method, only header files that contain type defenitions and macros are included, while all header files that contain global variable declarations or function declarations are suppressed, thus forcing a partial ordering on the inclusion of the .c files in potu.c.

I don't know yet what to do with the original source code. I want to keep it at least for some time, since it has proven to produce very reliable executables. The executables based on the source code in /potu passes all tests in the valid.bra file, though.

rljacobson commented 2 years ago

You tag the last commit with that source code with a meaningful name. That way if there is ever an issue you know where to find it.

I am interested to know if there are any meaningful differences in benchmarks. Compiling potu.c should theoretically produce equivalent output to the original source. My prediction is that separate compilation will not produce a detectable difference. Also, what are the differences in compilation times between the different methods? I'll try it on my M1 MacBook Pro and report back.

BartJongejan commented 2 years ago

Now there are three ways to create an executable: A. Compile and link the old bracmat.c, json.c and xml.c, B. Compile each of the .c files in the potu subfolder and then link, and C. compile potu.c, which includes all other .c files, and then link.

A . gives the smallest executable. B. and C produce executables of the same size, at least with the Visual C compiler. To begin with, I saw a speed difference when running the validation test, valid.bra, B. being slower than A. and C., but the differences disappeared when I continued breaking up the source code.

I have not measured build times.

Op zo 13 mrt. 2022 19:12 schreef Robert Jacobson @.***>:

You tag the last commit with that source code with a meaningful name. That way if there is ever an issue you know where to find it.

I am interested to know if there are any meaningful differences in benchmarks. Compiling potu.c should theoretically produce equivalent output to the original source. My prediction is that separate compilation will not produce a detectable difference. Also, what are the differences in compilation times between the different methods? I'll try it on my M1 MacBook Pro and report back.

— Reply to this email directly, view it on GitHub https://github.com/BartJongejan/Bracmat/issues/11#issuecomment-1066154261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALETLFWEQH7DCU7KDZDLH3U7Y4Z7ANCNFSM5NYADHPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you modified the open/close state.Message ID: @.***>