NomicFoundation / slang

Solidity compiler tooling by @NomicFoundation
https://nomicfoundation.github.io/slang/
MIT License
235 stars 21 forks source link

challenges of working with the Solidity AST #1099

Open 0xalpharush opened 1 week ago

0xalpharush commented 1 week ago

Slang has the opportunity to improve from solc's mistakes and limitations to enable the next generation of analysis frameworks. What follows is a brief list of edge cases and nuisances I recall encountering while working on Slither that I can think of off-hand. It may be useful to consult the issues and PR's linked (or searching the github tracker) in addition to using test cases from Slither that have been compiled to fix challenging issues in supporting different versions or quirky cases.

Tuple expressions are difficult to work with and not straightforward

It would be nice if Slang represented tuples as fully typed with their indices so that there is no need for record keeping of initialization or elided variables.

There are 2 cases we need to handle: 1) Initializing multiple vars with multiple individual values:

                    var (a, b) = (1,2);

In this case, we convert to the following code:

                    var a = 1;
                    var b = 2;

2) Initializing multiple vars with any other method (function call, skipped args, etc):

                    var (a, , c) = f();

In this case, we convert to the following code:

                    var a;
                    var c;
                    (a, , c) = f();

Changes in syntax that introduce ambiguity

For example, between versions .value() is either a builtin/intrinsic (.value() member pre 0.7 https://github.com/crytic/slither/issues/1923) or a function call thereafter. Unifying the call options across versions would be nice as well (value, salt, etc)

Missing control flow information in the AST

Logical operators like && and || as well as ternary operators do not reflect their laziness/ eagerness or short-circuiting behavior. Personally, I would like to see these lifted to more a structured control flow that can be uniformly analyzed such as an if-else statement.

Additionally, implicit named returns are not reflected in the AST and one must analyze where the exit points are in the CFG and add a return node artificially. (https://github.com/crytic/slither/pull/1880)

Ambiguities caused by explicit arguments to events, structs, functions, etc

Solidity supports explicit passing arguments as {b: ... c: ..., a: ...} in addition to the more common implicit, positional option (a, b, c). Similar to tuples this can cause issues and requires additional record keeping that may not be in the AST (https://github.com/crytic/slither/pull/1949)

Lack of info on truncation/casting behavior in Solidity AST

For instance, across versions the shifting behavior changes and it's totally implicit (https://docs.soliditylang.org/en/latest/070-breaking-changes.html). Additionally, the type conversion rules for constants are implicit and it's not always clear what type the frontend considers literals and what type conversion rules it follows to perform typecasts (https://github.com/crytic/slither/issues/1931, https://github.com/crytic/slither/issues/1688). Representing every type cast as an operation in the AST would be valuable.

Import/export of using for directives and aliases causes ambiguities

Unlike other definitions, solidity does not report the bindings of user defined operators to types (notably global ones) in its exportedSymbols field of the AST. This makes it difficult to track which references are in scope (directly or transitively). I'm sure this could be improved in Slither but older versions of solc had unreliable reference ID's. (https://github.com/crytic/slither/blob/2c792b2b73c6c1fbbf5464bd1f9fc8ccedf0c0bf/slither/core/scope/scope.py#L69-L71)

Additionally, aliased imports further complicate resolving references as an aliased import can be imported and aliased again. Slither performs a late lookup that is not ideal to be able to find symbols. Likely, it could use reference ID's but I'm not sure if there's limitations for older versions... (https://github.com/crytic/slither/pull/2166) https://github.com/crytic/slither/blob/2c792b2b73c6c1fbbf5464bd1f9fc8ccedf0c0bf/slither/solc_parsing/expressions/find_variable.py#L141-L144 https://github.com/crytic/slither/blob/2c792b2b73c6c1fbbf5464bd1f9fc8ccedf0c0bf/slither/visitors/slithir/expression_to_slithir.py#L602-L617

Cyclic imports and top level definitions

Solidity allows cycles in the import graph and this can make it difficult to determine the order to parse AST nodes as there's no topological ordering. Slither has had to incrementally figure out the right order (often new versions had very little complex code available until the behavior was clarified by developers writing more). For example, top level errors can be referenced that have not been analyzed yet and thus their signature is not known yet https://github.com/crytic/slither/blob/2c792b2b73c6c1fbbf5464bd1f9fc8ccedf0c0bf/slither/solc_parsing/expressions/find_variable.py#L149-L165

Making sure scoping rules are accurate

Backporting fixes

Solc has not backported fixes to my knowledge and this makes it difficult as a third-party tool dev when user's are stuck on a version with a bug. Ideally, fixes would be backported

Many of the challenges are probably mitigated by having a really great reference-declaration resolver and API. Some of the issues are probably poor abstractions in Slither that were hard to change while maintaining backward compatibility and supporting solc's legacy AST, so it may be that many of these are not applicable to Slang. As a principle, any semantic info that would be needed to write a compiler or reference interpreter for the langauge should be included in the AST (https://www.youtube.com/watch?v=rBfSs2yipoE&t=1364s). I hope this is useful and best of luck!

OmarTawfik commented 5 days ago

Thank you so much for the feedback! this is really helpful!

Tuple expressions are difficult to work with and not straightforward

Ambiguities caused by explicit arguments to events, structs, functions, etc

The Slang CST/AST matches the input source code 1:1, without any modifications, so tuples/arguments/etc.. will keep the original indices as they exist in input. In the future, when we ship the type system, it will have direct links to the same syntax nodes, along with their locations, so that users don't have to build that again. Here are examples of the CST nodes produced for tuples today: [1] [2].

Changes in syntax that introduce ambiguity

Our CST is designed to match the source 100%. So while future releases of Slang might add/change some syntax elements, each single release of Slang is designed to represent the entirety of Solidity, across all Solidity language versions. Therefore, Slang users need to worry about one set of types/inputs, and expect them to be stable regardless of which version of Solidity they are analyzing. If Solidity syntax itself changes, then both old/new syntaxes will be represented in our tree, and both will exist in our types/API, so that your code can handle both.

Moving forward, as we work on our backend/type system, these version differences will indeed be unified into a single API. For example, named functions, unnamed functions, receive/fallback functions, and other function variations can be unified into a single interface, regardless of whether you are analyzing old or new Solidity syntax.

Import/export of using for directives and aliases causes ambiguities

Our Bindings API (in progress) allows using cursors (pointers to specific tree nodes) to resolve references or definitions, and navigate that graph freely. Cursors will have explicit source locations, and not numeric IDs, and allow bi-directional lookups.

Cyclic imports and top level definitions

Our Compilation API (in progress) allows building the entire project from any node, and we will crawl all import statements and load imported files transitively, until the entire graph is built. It will be able to detect cycles and handle them.

Missing control flow information in the AST

Lack of info on truncation/casting behavior in Solidity AST

Making sure scoping rules are accurate

Backporting fixes

These are valuable pieces of feedback, and we will definitely consider when completing our backend. Thank you!