Parse all modules in a project before beginning the type-checking stage

mitchmindtree commented 2 years ago

Currently, we parse and type-check each module in a sway project one at a time in a rough depth-first order.

We should separate the concerns of parsing and and type-checking into separate stages both for code clarity and to unlock solutions to many issues related to inter-module definitions and the order in which we type-check items.

From a quick scan, I believe this is a blocker (or at least a first-stage) for addressing the following:

201
409
870
1267
1557.
@emilyaherbert's recent idea w.r.t. introducing a new "collection" stage prior to type-checking (no issue yet).

Potential Solution

The existing sway_parse::Program type more closely matches what we currently conceive of as a parsed "module" in Sway. We should rename what is currently referred to as program to module. I.e. sway_parse::program becomes sway_parse::module, sway_parse::Program becomes sway_parse::Module.
Re-introduce a new sway_parse::Program type that is a small wrapper around a tree of sway_parse::Modules.
Add a high-level function the sway_parse lib.rs called parse_project that begins by calling parse_file for the root module, then parses loads and parses all submodules based on the parent module's parsed dependencies.
Refactor the type_check step in terms of the new sway_parse::Program type. More specifically, refactor the import_new_file function to import_new_dependency where rather than loading and parsing the file, we look up the pre-parsed module in the module tree.

Marking as P: Critical as this is a blocker for #409 (critical) as well as a number of other high priority issues.

canndrew commented 2 years ago

I'm not sure that the logic for hunting down files for sub-modules is something that belongs in the parser. The parser is, in my mind at least, just something that converts a block of text to an AST (where that AST represents the structure of the original text as literally as possible). The old parser already had far too much logic built into it which is why the old AST still exists and we now have another AST layer on top of it. So I'm wary of heading in that direction again. I think it's important to keep the stages of the compilation pipeline as simple and well-defined as possible.

I think the general idea of what you're proposing is right though, just that it belongs in sway-core. Consider, for instance, that if we ever decide to add rust-like macros then it won't be possible to know about sub-modules until after macro expansion.

mitchmindtree commented 2 years ago

Thanks for the input @canndrew, that makes a lot of sense.

Revising the "Proposed Solution" above with this in mind, I'm thinking we:

Keep step 1.
Change steps 2. and 3. to introduce the new parsed Program type and the parse_project function (or better name) into sway-core rather than into sway-parse.
Keep step 4.

canndrew commented 2 years ago

Yep :+1:

emilyaherbert commented 2 years ago

Currently, we parse and type-check each module in a sway project one at a time in a rough depth-first order.

We should separate the concerns of parsing and and type-checking into separate stages both for code clarity and to unlock solutions to many issues related to inter-module definitions and the order in which we type-check items.

I 100% agree with this. But, at a high level, I think that this proposal is missing a key step will create a powerful mechanism in the compiler that we can use to solve a bunch of different issues.

At a high-level I propose this design:

step 1) the parsing phase. Parsing of all modules, separately. The goal of this step is to transform simple tokens to an internal AST representation. This could be the Expression type.

step 2) the "collection" phase. This is a new phase that we would need to implement, and what it would do is this: It would gather high level information about the declarations inside of each module, without type checking the contents of these declarations. The result of this phase would be some sort of "collection context" that contains key information about struct definitions, enum definitions, function definitions, trait definitions, impl definitions and more.

step 3) the type checking phase. During this phase, the "collection context" would be passed around during type checking, such that all declarations (enum/struct/etc) have the ability to have knowledge about every other declaration. Importantly, we would need to create a new data structure to allow us to do this without dummy function bodies. Similar to the type engine, I propose a "declaration engine" that allows declarations of the type Ref that refer to declarations that will be type checked out of order.

The benefits of this design are this:

step 2 introduces a powerful mechanism that will allow us to grow the compiler in the future, adding additional layers of inference, static analysis, and optimizations
it allows us to have knowledge of all declarations, from the perspective of any one module. This would allow us to provide awesome benefits to users. For example, when using Rust in VSCode, VSCode will offer import suggestions for types as you are coding. For any one new type that you use, VSCode + Rust might have n different suggestions, all of which are modules that are yet to be imported to the module you are currently working on. There is even an auto-import feature that I personally use constantly.
This change opens the doors for the compiler to support recursive functions, natively, reducing code size
We could solve each of these issues inherently (either wholly or in part), without any temporary fixes:

emilyaherbert commented 2 years ago

To be clear @mitchmindtree, the "recent idea w.r.t. introducing a new "collection" stage prior to type-checking" idea that I mentioned previously would encapsulate the changes that you are hoping to see. By introducing the idea of a "collection" stage as a mechanism, we would be able to apply that mechanism to a variety of issues.

sezna commented 2 years ago

This sounds interesting and could be a really valuable addition! A few questions for my own clarity:

We currently parse things and send around untyped symbols (the "collection" phase). Are you suggesting we move the parsing of dep statements earlier in the compiler so we have access to all of them?
We currently use node_dependencies to order our "collection" for type checking. Are you suggesting that would become uneccessary? And if so, how?
it allows us to have knowledge of all declarations, from the perspective of any one module. This would allow us to provide awesome benefits to users.

We already have this via the Root namespace. We could suggest auto-imports via that namespace currently.

I think there may be some issues here that are misrepresented, or that I'm understanding incorrectly. I went through them briefly and maybe you can point out where I'm missing something, as I'm not 100% seeing the link yet:

65 is related to codegen, not dependency resolution. We don't have the notion of a function stack.
76 used to be implemented but had a bug so was reverted, and would still require a manual check. I'm not sure how this change would alter that.
201 we would still need a manual check right? We can do this currently, no?
409 This one would be solved for sure by moving the parsing of dep statements earlier so we could build a dependency map. I think that's the main value prop here.
738 -- could you elaborate on how this would impact the tracking issue for future IR? Are you suggesting we would stop inlining functions as a part of this? If so, it should be noted that that's actually also due to codegen difficulties.
862 I'm still not clear on what the original complaint is in this issue, so I can't comment. It is ambiguous to me what makes our current notion of unification/reference-based equality in the type system not "true".
870 is in node_dependencies and would still be an issue requiring a manual check.
970 is just something that hasn't had type checking implemented for it yet, not sure how a parsing change like this would impact that.
1159 same as above
1162 same as above
1163 same as above.
1163 same as above.
1261 This is actually related to the special casing done around the constant evaluation for abi casts -- we need to know more information at compile time than with typical method applications, and there's a bug in that constant evaluation during the type checking phase.
1267 is related to adding a monomorphization step, not dep parsing, and is blocked on the implementation of type inference.
1298 is just a dead_code_analysis link that's missing.
1311 is strictly a type checking bug
1325 is a feature proposal and could be implemented with the way things are right now.
1491 is an error message change and could be solved with #1261
1527 once again inlining is not a collections/parsing limitation and was rather done for ease of codegen. We can remove the inline everything approach.
1548 is related to the fact that we don't name inner methods as top-level declarations and therefore they are not ordered. This would still require implementation with the new approach as far as I can see.
1555 is related to the same inlining thing.
1557 is the inlining thing.
1584 is pretty unrelated in general, it's missing check for a recursive enum decl.

I'm reading these issues and thinking that #1557 is the core issue that you're suggesting this collecting phase would solve, but actually that's a codegen/type checking thing and has no relation to the fact that modules are parsed in a project after type checking begins. Could you elaborate on how this could impact #1557?

With regards to the titular issue, parsing all modules brought in by dep statements before the type-checking stage would definitely be an improvement. deps are black boxes right now (until they're seen). I think the solution outline is reasonable.

mitchmindtree commented 2 years ago

@emilyaherbert I'm very excited about your proposal for a collection phase between parsing and type-checking! I think it makes a lot of sense, and will not only assist with monomorphizing, but also help to resolve the order in which declarations should get type-checked across modules more generally, and probably a bunch of other useful pre-type-check-preparation stuff as you mention :)

Would you mind opening a dedicated issue for the new collection phase with all of these motivations and related issues?

While I understand there's some overlap with this issue, my original intent for this issue was to be a small, actionable, easily reviewable, first-step toward these larger refactorings that we want to make. I still think it might be worth addressing the separation of parsing and type-checking as described in this issue first, and that doing so would make the addition of a new collection phase in between just a little easier/smaller.

emilyaherbert commented 2 years ago

I am proposing that we perform a fundamental shift in how the compiler thinks of 1) ast node ordering, and 2) how the compiler performs its internal tasks.

1) Right now, the ordering of the AST is critically important for how the compiler performs its internal tasks. In my opinion, this is an anti-pattern for a compiler. The order of evaluation of the AST nodes should not be a factor of importance inside the compiler. What I am proposing is that we achieve a fundamental shift in mindset, changing the compiler such that evaluation order of AST nodes has no effect. After this change, we could theoretically randomize the order of evaluation and the output should be identical.

Are you suggesting we move the parsing of dep statements earlier in the compiler so we have access to all of them?

Yes. This proposal suggests that we parse all the Sway files involved in a project first. Then perform the "collection" phase on all files, creating a "collection context." The "collection context" would be used to perform type checking, IR stuff, code gen, etc, such that it would make the concept of evaluating the AST in the "right order" obsolete.

We currently use node_dependencies to order our "collection" for type checking. Are you suggesting that would become uneccessary? And if so, how?

Yes and no. node_dependencies would be replaced with the "collection context".

We already have this via the Root namespace. We could suggest auto-imports via that namespace currently.

Sure, but this is a difficult task when the enum that you need to import, for example, is to be evaluated at a later point in the internal AST ordering. By removing the need to internally order the AST, this task becomes easier.

I think there may be some issues here that are misrepresented, or that I'm understanding incorrectly.

At a high-level, there are several issues that I believe to be intractable without this proposal. There are also issues that could be patched or avoided given the current system. But, they must all be addressed individually, each requiring their own level of involvement with creating a new mechanism within the compiler (additional checks, dummy functions, etc). In additional to tackling the intractable issues, this proposal seeks to eliminate a wide swatch of other issues at the fundamental level, without the need to address them individually. Additionally, I believe that "special-casing" particular fixes and check to be an anti-pattern, and this proposal eliminates much of the need for "special-casing."

Let me go through the issues individually:

https://github.com/FuelLabs/sway/issues/65 is related to codegen, not dependency resolution. We don't have the notion of a function stack.

This proposal would allow us to perform type checking of recursive functions without the need for dummy functions, and would make the idea of a "function engine" (discussed on slack) significantly more simple.

i.e. imagine that the "collection phase" collects: 1) function signatures for all functions, 2) information about all traits, 3) signatures for methods for traits, 4) information about which traits are implemented for which types, across all Sway files, 5) the signatures for all of those methods from all of those traits for all of those types. All before type checking. This is information that we fundamentally do no have in the compiler during type checking as it stands currently.

https://github.com/FuelLabs/sway/issues/76 used to be implemented but had a bug so was reverted, and would still require a manual check. I'm not sure how this change would alter that.

This proposal shifts this mindset. There is no need for a check for recursive dependencies. Recursive dependencies are fully supported, out of the box.

https://github.com/FuelLabs/sway/issues/201 we would still need a manual check right? We can do this currently, no?

There is no need for a manual check. Recursive use is supported natively.

https://github.com/FuelLabs/sway/issues/409 This one would be solved for sure by moving the parsing of dep statements earlier so we could build a dependency map. I think that's the main value prop here.

This proposal encompasses this change. The "collection context" could contain information about the dependency graph, making the concept of "explicit ordering" obsolete.

https://github.com/FuelLabs/sway/issues/738 -- could you elaborate on how this would impact the tracking issue for future IR? Are you suggesting we would stop inlining functions as a part of this? If so, it should be noted that that's actually also due to codegen difficulties.

Yes, we could stop inlining functions as a result of this proposal. As I mentioned above, this proposal would help us construct a "function engine" meaning we could stop inlining functions, without the need for introducing dummy functions. Also I will note that this is not just a codegen issue https://github.com/FuelLabs/sway/issues/1557. Before codegen even occurs, type checking is currently unequipped for non-inlining of functions.

https://github.com/FuelLabs/sway/issues/862 I'm still not clear on what the original complaint is in this issue, so I can't comment. It is ambiguous to me what makes our current notion of unification/reference-based equality in the type system not "true".

The current method of monomorphization creates one copy per call to each function or method, leading to wasted compute and unneeded code size. This PR is tracking changing this to creating one copy per type usage pattern, and will need to go in after we have stopped inlining functions.

https://github.com/FuelLabs/sway/issues/870 is in node_dependencies and would still be an issue requiring a manual check.

This proposal removes the need for a manual check. Out of order impl would be supported natively.

https://github.com/FuelLabs/sway/issues/970 is just something that hasn't had type checking implemented for it yet, not sure how a parsing change like this would impact that. https://github.com/FuelLabs/sway/issues/1159 same as above https://github.com/FuelLabs/sway/issues/1162 same as above https://github.com/FuelLabs/sway/issues/1163 same as above. https://github.com/FuelLabs/sway/issues/1163 same as above.

Fundamentally, this proposal is not just a parsing change. It is a complete shift in how the compiler thinks of dep/decl/impl/etc ordering, and eliminates the need for specific ordering at all. During implementation of this shift, parsing will be affected.

Given this example:

struct Foo<T> {
  x: T
}

impl<T> Foo<T> where T: Double + Triple {
  fn math(self) -> Foo<T> {
    let a = self.x.double();
    let b = self.x.triple();
    Foo {
      x: a + b
    }
  }
}

trait Double<T> {
  fn double(self) -> T;
}

trait Triple<T> {
  fn triple(self) -> T;
}

impl Double<u64> for u64 {
  fn double(self) -> u64 {
    self + self
  }
}

impl Triple<u64> for u64 {
  fn triple(self) -> u64 {
    self + self + self
  }
}

During type checking of the impl<T> Foo<T> where T: Double + Triple block, the compiler would have an understand of all the types in the entire project that implement Double and Triple, and would be able to use that information.

The change might not be so apparent with this example. But imporantly, mutually recursive impl's would be inherently supported.

Function from abi not recognized when called from an instantiated ContractCaller #1261 This is actually related to the special casing done around the constant evaluation for abi casts -- we need to know more information at compile time than with typical method applications, and there's a bug in that constant evaluation during the type checking phase.

I don't know what specific additional information that you have in mind, but the "collection context" could help us gather that information.

https://github.com/FuelLabs/sway/issues/1267 is related to adding a monomorphization step, not dep parsing, and is blocked on the implementation of type inference.

This proposal would make the idea of a "function engine" and delayed monomorphization significantly easier.

i.e. imagine that the "collection context" collects information about which types use each generic function. If this were the case, we could perform monomorphization separately, and could save compute by avoiding non-unique monomorphization. Rust does something similar: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_monomorphize/collector/index.html

https://github.com/FuelLabs/sway/issues/1298 is just a dead_code_analysis link that's missing.

This definitely could be patched with the current system. However it would still be affected by this proposal.

https://github.com/FuelLabs/sway/issues/1311 is strictly a type checking bug

This proposal means that this check could happen before type checking, if we wanted to do that. Potentially saving compute.

https://github.com/FuelLabs/sway/issues/1325 is a feature proposal and could be implemented with the way things are right now.

Currently, type checking is conflated with dependency resolution, meaning that ripping out type inference will be a significant amount of work. This proposal lays the ground work for adding this feature in a more simple way.

https://github.com/FuelLabs/sway/issues/1491 is an error message change and could be solved with Function from abi not recognized when called from an instantiated ContractCaller #1261

This proposal would make this more simple.

https://github.com/FuelLabs/sway/issues/1527 once again inlining is not a collections/parsing limitation and was rather done for ease of codegen. We can remove the inline everything approach.

Inlining functions is currently limited by type checking https://github.com/FuelLabs/sway/issues/1557. Type checking does not currently have any mechanism for handling non-inlined functions. We chatted about this on slack, but I believe the best solution to this is the concept of a "function engine", and the concept of a function engine is made significantly easier with this proposal. Without this proposal, the function engine would rely heavily on a mechanism that accounts for functions or methods that may or may not exist. But with this proposal, during type checking when the function engine is being populated and used, it would have the knowledge of the existence of all functions and methods in the project.

https://github.com/FuelLabs/sway/issues/1548 is related to the fact that we don't name inner methods as top-level declarations and therefore they are not ordered. This would still require implementation with the new approach as far as I can see.

Yes, but it would be made easier with this proposal.

https://github.com/FuelLabs/sway/issues/1555 is related to the same inlining thing.

This would be made easier with this proposal.

https://github.com/FuelLabs/sway/issues/1557 is the inlining thing.

Inlining would be significantly easier with this proposal.

https://github.com/FuelLabs/sway/issues/1584 is pretty unrelated in general, it's missing check for a recursive enum decl.

This proposal makes this issues solvable, without the need for special-casing. Currently, it would be impossible to preemptively know that this enum declaration contains a recursive element, without doing it inside type checking.

I'm reading these issues and thinking that https://github.com/FuelLabs/sway/issues/1557 is the core issue that you're suggesting this collecting phase would solve, but actually that's a codegen/type checking thing and has no relation to the fact that modules are parsed in a project after type checking begins. Could you elaborate on how this could impact https://github.com/FuelLabs/sway/issues/1557?

This is not the core issue that this proposal solves. This proposal introduces a "collection context" mechanism that has many applications. One application is that it would eliminate the concept of "dep/impl/decl ordering" from the compiler. Another application is that it would ease efforts towards function inlining and towards reducing non-unique monomorphization. Another application could be improving type inference.

emilyaherbert commented 2 years ago

While I understand there's some overlap with this issue, my original intent for this issue was to be a small, actionable, easily reviewable, first-step toward these larger refactorings that we want to make. I still think it might be worth addressing the separation of parsing and type-checking as described in this issue first, and that doing so would make the addition of a new collection phase in between just a little easier/smaller.

SGTM @mitchmindtree !

FuelLabs / sway

Parse all modules in a project before beginning the type-checking stage #1578

201

409

870

1267

1557.

Potential Solution

65 is related to codegen, not dependency resolution. We don't have the notion of a function stack.

76 used to be implemented but had a bug so was reverted, and would still require a manual check. I'm not sure how this change would alter that.

201 we would still need a manual check right? We can do this currently, no?

409 This one would be solved for sure by moving the parsing of `dep` statements earlier so we could build a dependency map. I think that's the main value prop here.

738 -- could you elaborate on how this would impact the tracking issue for future IR? Are you suggesting we would stop inlining functions as a part of this? If so, it should be noted that that's actually also due to codegen difficulties.

862 I'm still not clear on what the original complaint is in this issue, so I can't comment. It is ambiguous to me what makes our current notion of unification/reference-based equality in the type system not "true".

870 is in `node_dependencies` and would still be an issue requiring a manual check.

970 is just something that hasn't had type checking implemented for it yet, not sure how a parsing change like this would impact that.

1159 same as above

1162 same as above

1163 same as above.

1163 same as above.

1261 This is actually related to the special casing done around the constant evaluation for `abi` casts -- we need to know more information at compile time than with typical method applications, and there's a bug in that constant evaluation during the type checking phase.

1267 is related to adding a monomorphization step, not dep parsing, and is blocked on the implementation of type inference.

1298 is just a `dead_code_analysis` link that's missing.

1311 is strictly a type checking bug

1325 is a feature proposal and could be implemented with the way things are right now.

1491 is an error message change and could be solved with #1261

1527 once again inlining is not a collections/parsing limitation and was rather done for ease of codegen. We can remove the inline everything approach.

1548 is related to the fact that we don't name inner methods as top-level declarations and therefore they are not ordered. This would still require implementation with the new approach as far as I can see.

1555 is related to the same inlining thing.

1557 is the inlining thing.

1584 is pretty unrelated in general, it's missing check for a recursive enum decl.