Parser - Githubissues

cucapra / pollen

generating hardware accelerators for pangenomic graph queries

MIT License

24 stars 1 forks source link

Parser #38

Closed susan-garry closed 1 year ago

susan-garry commented 1 year ago

The beginning of a bona-fide parser for pollen! This lays out some basic infrastructure and implements parsing for a few basic features.

What it can do:

Parse variable declarations and initializations, e.g. int i; int i2 = e
Parse simple uop and binop expressions, both for integers and booleans
Output the AST of the pollen program that it parses

What it can't do/could do better:

Arrays, function calls, function definitions, record definitions, record field access, tuples, emit statements, etc
Currently, the AST is defined in one file. This made life easier when I was trying to debug the parser/AST, but at some point the definitions for each AST node should probably be moved into separate files to avoid having too many lines of code in one file and make it easier to implement, e.g., pretty printer functions.
Currently, I have a single (long) test file that I run via cargo run test/test1.txt. In the near future, it would be great to automatically run the parser on all files under the test folder, and add tests for files that should not parse correctly, and instead throw an error.
Parsing errors aren't super readable (e.g. Parse failed: Error { variant: ParsingError { positives: [add, sub, mult, div, modulo, geq, leq, lt, gt, eq, neq, and, or], negatives: [] }, location: Pos(1219), line_col: Pos((77, 25)), path: None, line: "int int1 = [3 * (2 + 4)]", continued_line: None } tells me that I have forgotten a semicolon). It would be great to find a way to make these a bit more readable.

sampsyo commented 1 year ago

Wahoo!!!! Super nice work getting the parser going, @susan-garry! And figuring out the trickiness with Pest along the way!

There is one small clerical set of things we'll want to clean up before merging: some build targets got added to git here. Let's remove these things and add some of them to a .gitignore somewhere:

pollen/target/, which is Rust's build output directory
.DS_Store
any .rlib files (those are Rust libraries)

And here are some thoughts about strategy for the future, all of which deserve to happen in subsequent PRs instead of this one:

This is purely a taste thing and truly not very important, but I think concrete syntax for typed languages tends to work out quite a bit better when the type goes after the identifier (as in Rust or Go or Swift or TypeScript or Python's type annotation syntax) instead of before (as in C or Java). That means stuff like let x: Edge = ... instead of Edge x = .... Again, purely a taste thing, but I think this is nicer and ends up easier to parse too!
While you have enumerated plenty of good ideas for extensions, let's be purely "demand-driven" from here on out. That is, let's focus on writing examples and only adding features that those examples need, rather than on adding features that might be useful down the line.
Don't worry about bad type errors. They are part of life! And getting good errors is often a lot of work!! I very much think we should not try to make them good yet.
Maybe add a Turnt setup to support writing more, smaller test cases?

susan-garry commented 1 year ago

Ah, whoops! I have updates the .gitignore file and removed the extraneous files - the 1400 modified files probably should have tipped me off that something was amiss.

About our strategy for the future:

I find that I'm a bit ambivalent as to whether putting the type after the variable identifier is clearer. There are sort of three options: int i = 1;, let i: int = 1;, and i: int = 1;. If we're trying to make the language accessible to non-CS folks, then the let keyword may be a bit alien. I'm not aware of any language that picks the third option (from what I can tell based on skimming the docs, python uses type hints in function definitions but not for variable declarations?) but perhaps it's a little more comprehensible than the first and a little more accessible than the second. Thoughts? (I know that one concern we might have is whether anyone will adopt the language and use it in practice, but if we're the ones maintaining it then we should make sure it's something we're motivated to work on).
I agree that we should start by adding minimal possible support - of the things listed above, I think function calls, record declarations, record field access, and emit statements are the only absolutely critical components to get real examples working, like crush. To get something like node depth or node degree working, however, I think we will need to support either tuples or arrays.

sampsyo commented 1 year ago

python uses type hints in function definitions but not for variable declarations?

It also has type hints on variable declarations! So this is in fact valid Python:

i: int = 1

My general take on this is: pick any keyword you want (let, var, decl, def, const, local, nothing), but the type-after-the-identifier style is the way of the future. C & Java are old and do it the old way; Rust, Go, Swift, TypeScript, typed Python, Scala, and Kotlin are all new and learned from the mistakes of the past and do it the new way. A big advantage is that it lends itself well to adding type inference in the future, i.e., eventually supporting let x = 5 instead of let x: int = 5 if the compiler can deduce that for you without either a disruptive change to the syntax or an annoying non-type type name like C++ auto.

I think function calls, record declarations, record field access, and emit statements

Certainly emit statements! That's important for producing any output. Makes sense.

Not sure about the others though: when you get a chance, maybe you could elaborate somewhere on how function and records arise? I can imagine functions not mattering much for depth, for instance.

susan-garry commented 1 year ago

That makes a lot of sense to me - I will go with i: int = 1; unless someone has strong feelings about using a keyword like let.

Record access definitely needs to be supported because we are representing pangenomic graphs essentially as a bunch of records - if we want to compute node depth, we need to call node.steps. I cannot, off the top of my head, think of a meaningful graph query that doesn't involve this.

Aside from this, we have two basic types of output - mappings of nodes to data, and modified graphs. We will probably wants to support at least one of these in the basic iteration of pollen. To support mappings of nodes to data, we probably need to support either arrays, tuples, or record definition and initialization (though not all three). To support the outputting of new graphs, we need record initialization (though not record definition), since we represent graphs using records.

Since record field access is essential either way, and we can support both types of output through records, that seems like the most logical choice for what to support beyond the emit statement.

EclecticGriffin commented 1 year ago

Ah, whoops! I have updates the .gitignore file and removed the extraneous files - the 1400 modified files probably should have tipped me off that something was amiss.

Random note that gitignore.io can be helpful for initial setup esp with regard to temp files.

Also might be worth noting that git exclude also exists which is basically just gitignore but only local (not committed). Useful if there are some files you want ignored that are specific to your dev flow. For example I have a local_examples folder where I put calyx files I'm trying to debug or mess with and having it in the git exclude saves me from having to side step those files when adding things to a commit.

sampsyo commented 1 year ago

Cool cool; the argument for "dot syntax" to get things like node.steps makes sense to me!!

susan-garry commented 1 year ago

I've verified that make test-slow-odgi and make test-slow-flip work as expected, so I will go ahead and merge what we have so far!