INRIA / spoon

Spoon is a metaprogramming library to analyze and transform Java source code. :spoon: is made with :heart:, :beers: and :sparkles:. It parses source files to build a well-designed AST with powerful analysis and transformation API.
http://spoon.gforge.inria.fr/
Other
1.71k stars 344 forks source link

Discussion: Redesigning the sniper printer #3794

Open slarse opened 3 years ago

slarse commented 3 years ago

Hi!

Over the past few months, I've been working out kinks and bugs in the sniper printer quite frequently. It's a very cool piece of software that does some impressive things, but I have noted several problems with the design that, in my opinion, cannot be overcome with the current architecture. I'd therefore like to discuss a complete redesign.

For context, I'll first provide a brief description of what the sniper printer does and how it does it, then I'll move on to the problems, and finally my proposed solution.

What is the purpose of the sniper printer?

The sniper printer provides a means to parse source code into a Spoon AST, perform some transformations, and then print it back to file with minimal formatting changes to non-transformed parts. In the literature, this has been referred to as high-fidelity code transformation.

For applications where the output of a Spoon transformation is actually stored as production code, such as for a merge tool or an automatic repair tool, retaining formatting is absolutely critical. Imagine getting a pull request that fixes some static analysis warning, and also reformats the entire file containing the warning. That probably doesn't get accepted.

How does the sniper do high-fidelity printing?

The very short explanation is this: Alongside the Spoon AST, we've got a tree of source fragments. The SniperJavaPrettyPrinter (sniper) extends the DefaultJavaPrettyPrinter (DJPP), and whenever DJPP wants to print something, the sniper searches in the source fragments for something that matches what the DJPP wants to print. The sniper maintains a stack of contexts, which lets it know precisely where to search for the next fragment.

Why do we need to redesign the sniper?

The current design, as cool as it is and as well as it works most of the time, has three serious flaws. The first and largest flaw is that I don't believe it can ever work perfectly, by nature of its design. Essentially, we do transformations on the AST without regards to formatting, and then the sniper tries to recreate the formatting after the transformations have been applied. The more transformations that are applied, the less likely the sniper is to be able to match the source fragments correctly. The largest problems are with the small details, especially parentheses and whitespace, which often cause issues.

The second flaw is that the architecture is incredibly complex. The interactions between the DJPP and the sniper's extensions of it are very intricate, and debugging cases where the sniper prints something just a bit strangely can be very difficult.

The third flaw is that the sniper's dependence on a printing context. This can make it difficult to start printing anywhere in the source tree, and fails miserably if you combine different Spoon ASTs into one (e.g. what happens in a merge tool).

Solutions?

I think we should draw inspiration from other works here. For example, the IntelliJ platform has the Program Structure Interface, which contains formatting information alongside AST-like information. Another work used what they refer to as a _Literal-Layout AST (LL-AST), which is simply an AST augmented with formatting nodes (e.g. whitespace).

Obviously, we can't turn the Spoon AST into an LL-AST, that would be unworkable for users. What I suggest is to maintain an LL-AST in the background, such that any transformation on the Spoon AST is also applied to the LL-AST. Then, the sniper printer would simply print exactly what's in the LL-AST.

The obvious upsides of this is that it makes printing rather trivial, and the complexity is now moved to the transformations. I think that will be a lot easier to reason about and debug. I also think that it is a workable solution for making the sniper printing as close to perfect as possible.

The obvious downside of this is potential performance overhead, but there is already significant overhead in the current sniper printer. Although I do believe that maintaining a separate LL-AST would increase that overhead, it's entirely opt-in from a user perspective (don't use the high-fidelity mode, don't get the performance overhead).

Does anyone have any thoughts on this? My intention is to start prototyping the "backing LL-AST idea" in the coming weeks to see if it's a workable solution in practice, but before that I'd really like to discuss this with some bright minds. Maybe I've missed something :)

slarse commented 3 years ago

I've thought about maintaining an LL-AST in the background some more, and come to the following insights.

Deletions of elements should be very straightforward: simply delete the corresponding subtree in the LL-AST.

Insertions should be fairly straightforward. On an insertion, pretty-print from the root of the inserted subtree. It should be relatively straightforward to map the print actions to the correct nodes in the "regular" AST by using the "enter" and "exit" methods of the pretty-printer, and so it should not be too hard to build the LL-AST subtree. I'm sure there will be some complications, but I don't think anything insurmountable should appear here.

Modifications to existing nodes (e.g. renaming a method or inserting a modifier on a field) is where I think the bulk of the work has to be made. I'm hoping to make some clever use of CtRole here, but I'm relatively certain that there will be difficulties here.

jrfaller commented 3 years ago

Hi @slarse!

I'm coming here since it a better place to discuss than our long email thread.

First thanks for the analysis of prior art, it is clearly very handy to have a structure with everything in it such as the LL-AST.

Maybe a dumb question for starters but I am asking it anyway: why would it be so bad to have that directly in Spoon main's representation? They could be hidden to the user by default, but still present? Or there are various technical reasons making this impossible?

I ask that because from experience, I find it hard to keep consistency between complex structures, something that will be required to do with your proposal of background LL-AST, so if we could avoid it it would be great! If we cannot, I strongly suggest that one of the two structures is automatically derived from the other (presumably the AST from the LL-AST).

Then for the transformations, I am not sure any operation is easy in fact because it depends on how we can ventilate the whitespaces into the LL-AST.

I am trying a sample LL-AST for a simple snippet (for the sake of simplicity I omit terminal nodes, retaining only literals, sorry if it is imprecise w.r.t. to which node gets the spaces ๐Ÿ˜„ ).

Code

{
  int i = 0;
  i++;
}

LL-AST

Block#1
  {
  '\n\t'
  Statement#2
    int
    ' '
    i
    ' '
    =
    0
    ;
    '\n'
  Statement#3
    '\t'
    i
    ++
    ;
    '\n'
  }
  '\n'

Here for instance we have the first tabulation that went to the block node, so perhaps a deletion of Statement#2 will induce Statement#3 having two tabs upfront. If the tabs go into Statement#2 instead of Block#1, an insertion of a statement at the beginning of the block might lead to this statement not have a leading tab?

So maybe a first but important point is to have "rules" to decide which AST nodes get which whitespaces that ressembles the intent of developers (here for instance the tab is much more likely to belong to the statement than to the block whereas the newline is more a thing of the block).

Another point is maybe to have default rules to "decorate" programmatically inserted AST elements (for instance a new statement inside a block gets a tab at the beginning and a newline at the end.

A corollary of these two ideas is that the rule mechanism needs to be easy to extend and customize in order for users to tailor the transformations to their code (and possibly infer rules by seeing existing code, for instance the code before transformation).

Here is my two cent opinion ๐Ÿ˜ธ

Cheers.

slarse commented 3 years ago

Hi @jrfaller! :)

Just for the sake of not saying "the redesigned sniper printer", let's call it the LL-printer for further discussion :)

Maybe a dumb question for starters but I am asking it anyway: why would it be so bad to have that directly in Spoon main's representation? They could be hidden to the user by default, but still present? Or there are various technical reasons making this impossible?

Not at all a dumb question. The thing is that users of Spoon operate directly on the metamodel, and so it's really hard to introduce something into the metamodel and keep it hidden from users. Other than that, I see two primary reasons for why we'd want at least a layer of separation from the primary AST.

  1. Changing the metamodel is a huge commitment. This idea may not work out in the end, and so we definitely don't want traces in the public API.
    • Potential workaround (that I absolutely abuse in Spork): Put formatting data as metadata on the Spoon nodes.
  2. We don't want to incur the performance (and potentially: bug) overhead of maintaining formatting information if we don't have to. That is to say, if the LL-printer isn't used, we don't collect and maintain the formatting. Therefore, the formatting information may not always be present, or we'll have to synthesize it when building the AST (also, performance overhead).

I ask that because from experience, I find it hard to keep consistency between complex structures, something that will be required to do with your proposal of background LL-AST, so if we could avoid it it would be great! If we cannot, I strongly suggest that one of the two structures is automatically derived from the other (presumably the AST from the LL-AST).

I agree, if we could avoid maintaining two data structures, that'd be great. I don't think we can avoid maintaining two separate structures without literally putting formatting into the AST as proper nodes, which, again, would be terribly difficult to hide from users. But we may be able to "tack something onto" the Spoon metamodel, without actually changing it (see the final paragraph of this very long comment).

So maybe a first but important point is to have "rules" to decide which AST nodes get which whitespaces that ressembles the intent of developers (here for instance the tab is much more likely to belong to the statement than to the block whereas the newline is more a thing of the block).

Yes, this is necessary. The sniper already (kind of) has such rules. Typically, the whitespace that precedes any given element is attributed to that element. From your mini-model, that would entail e.g. that \n\t' would belong toStatement#2`. There's just one inconvenient problem with that, namely, which element does the final newline in the block belong to?

The same kind of rule also makes sense for separators (e.g. , and .) as Java does not allow trailing separators, with the sole exception of ; which is always trailing.

Another point is maybe to have default rules to "decorate" programmatically inserted AST elements (for instance a new statement inside a block gets a tab at the beginning and a newline at the end. A corollary of these two ideas is that the rule mechanism needs to be easy to extend and customize in order for users to tailor the transformations to their code (and possibly infer rules by seeing existing code, for instance the code before transformation).

The default printer already has rules for this, we'll use those. Any extensions to those default printing rules I think is somewhat outside the scope of the LL-printer, and would be better suited as an augmentation of the default printer. Essentially, the idea here is that if we have original source code, we print that. If we don't, we pretty-print with the default printer.

There's one exception: we may want to implement some form of "formatting inference" (the paper I linked that coined the word LL-AST have that). For example, if we have a method formatted like so:

public int add(
    int a,
    int b,
    int c) { ... }

and we add a parameter int d, it would be feasible to infer that it should be added with a preceding \n\t rather than just a space. But that's just a note for the future, the task at hand is to get something to work :D

As for a final concrete idea I had after reading your comment: yes, maintaining a complete, separate LL-AST is probably going to be too much work. What we could do, is to maintain formatting on a per-node basis, and keep it in metadata for the time being. The formatting data between proper AST nodes should be pretty easy to maintain: it will just exist attached to the node. The hard part is maintaining formatting data for the stuff that aren't proper AST nodes, i.e. the "content" of the nodes. The upside is that we can implement and experiment with such an approach on a per-node basis. That is to say, if we find formatting data on a node, we use that, otherwise we use the default printer.

WDYT?

jrfaller commented 3 years ago

Thanks for the insightful answer. I 100% agree with your proposal to use dedicated metadata, that will allow us to not mess the Spoon metamodel!

After having thought a little more on this I propose a simple three part metadata information relative to syntax:

WDYT?

monperrus commented 3 years ago

FTR, sniper pretty-printing, aka high-fidelity pretty-printing, is a notoriously hard problem, see papers referred to in https://github.com/INRIA/spoon/issues/1284

jrfaller commented 3 years ago

Thanks for the pointers @monperrus !

slarse commented 3 years ago

FTR, sniper pretty-printing, aka high-fidelity pretty-printing, is a notoriously hard problem, see papers referred to in #1284

Very hard indeed, but all more or less successful applications I can find do retain and transform formatting information. Most of the papers referenced in #1284 seem to do so. So, I very much believe that's the way to go!

  • leading layout (layout appended before the node)
  • trailing layout (layout appended after the node)

That sounds like a reasonable way to go. If possible, it would be best to just have one of those, but as I mentioned with corner cases before, there'll likely be exceptions where that is not possible. We may also want to somehow distinguish between whitespace layout, which in general carries no meaning, and non-whitespace layout (such as commas and semicolons).

separating layout (this one is interesting and might take care of a lot of problems: layout appended between each direct child of the node, for instance arrays, method parameters, ...)

This one I'm not sure about yet, it will require some experimentation. For example, in add(a, b), the , might be considered separating layout of add, or it might be considered leading layout of b. There are problems with both. Considering it leading layout of b poses a problem if a is removed, and considering it separating layout between a and b will become a bit of a headache if either argument is removed or modified.

Just off the top of my head, if we're not building a separate LL-AST, I think it will be much easier to only have leading and trailing layout to the extent that's possible, and then have hard-coded exceptions, e.g. that we don't print certain characters of leading layout in certain cases (such as a , in the leading layout to the first element of an element list).

We will still need separating layout for non-AST elements, such as modifiers. This is where I think it'll all get really difficult. But I think that right now, I really just need to start trying stuff out to get a better idea of where the pain points are.

jrfaller commented 3 years ago

This one I'm not sure about yet, it will require some experimentation. For example, in add(a, b), the , might be considered separating layout of add, or it might be considered leading layout of b. There are problems with both. Considering it leading layout of b poses a problem if a is removed, and considering it separating layout between a and b will become a bit of a headache if either argument is removed or modified.

This is funny because it was exactly the case where I thought it would simplify things. If we manage to get the information that the separating layout for the parameter list node is , (granted, I'm not sure this node really exists in Spoon metamodel ๐Ÿ˜„ since you have a virtual node for this), then insertion or deletion of parameters is mindless, because as soon as the node has children we have the following printing scheme for instance (P (a b)) we have lead(P) lead(a) a trail(a) sep(P) lead(b) b trail(b) trail(P) and if I'm not mistaking, it works whatever the number of parameters is? Maybe the rewriting rule do not work with all use cases though? Also, I agree that this is challenging to make the difference between leading / trailing layout and separating layout. But the scheme seems to work in some challenging cases, for instance when you have fine tuned parameter alignment like this:

foo(int a,
    int b,
    int c)

In this case separating layout is a comma, a new line and four spaces ๐Ÿ˜„

It do not work however for cases where you split the parameters declaration at an arbitrary point like

foo(int a, int b,
    int c)

Here we will have separating layout at best , and then a leading layout of four spaces before int c, and the style will probably be lost if the parameter list is modified too much. This one is frequent and is a headache because the break points are often guided by the length of the line.

slarse commented 3 years ago

I have a feeling that this will work great in most cases, but where it doesn't it'll be all but impossible to get it to function like we want. Like your final example there. That's pretty much the case with the current sniper: it works great in most cases, but when it doesn't it's almost impossible to fix it. The whole idea requires complete regularity in the (local) code, which is often there, but not always.

And no, there's no AST node for parameter lists, it's just a list baked into the executable node :)

Identifying a separating layout will however be useful for when we modify lists to insert new nodes with approximately the same formatting. It's possible that we can make "separating layout" work well enough where it's worth using to make modifications follow the general style of the code.

slarse commented 3 years ago

But it also depends on what we make the separating layout. It would make some amount of sense to have the separating layout be just ,, and then in the case of this:

foo(int a, int b,
    int c)

we also have lead(b) = _ and lead(c) = \n____ (_ represents space)

slarse commented 3 years ago

Regardless, the first challenge is to even be able to extract the layout and "attach" it to the correct nodes. I'm looking at how we can adapt what's already there in the sniper.

jrfaller commented 3 years ago

Yes I agree! I think that having the AST as a first step helps a lot.

Then I don't know exactly if we have to focus only on whitespaces or also to search for some symbols that are stripped in the AST (e.g. ; or , or even keywords such as class). Thinking about this, maybe it would be interesting to have separate metadata for whitespace (AKA layout) and literals like they do in the LL-AST paper, WDYT?

slarse commented 3 years ago

To do the printing "properly", we'll need other symbols as well, especially parentheses.

I still haven't had the time to start experimenting with this, but I expect to be able to toward the end of this week.

slarse commented 3 years ago

That expectancy was waaay off, got a bit bogged down with a paper deadline and work on Sorald :).

In any case, I've given this additional thought. The key functional goal that I'm looking for here is to be able to sniper-print an AST that's cobbled together from multiple sources (even multiple models). This essentially requires printing of any given node to be independent of its parent, which is currently not the case (the darned context).

We currently have two projects that need this functionality (that I'm aware of): Spork (combines three ASTs, from as many models) and diffmin (combines two ASTs, from as many models). If we can reimplement the printer such that these projects can use it, and maintain the current accuracy, that's a massive improvement. If we can improve the printer as a whole, that's even better.

As I also have a need to print conflict nodes, I'm probably going to want to build in support for some kind of generalized override mechanism, that's easily accessible.

I'm considering starting to tinker with these concepts over in Spork instead of here in the Spoon repo, as over there I already have tons of cobbled-together AST scenarios to test the printer on.

jrfaller commented 3 years ago

It seems a good idea!

slarse commented 3 years ago

For the record, we've decided to devote some amount of resources on this project, as sniper printing is a cornerstone of AST-based tools that need to go back to actual production source code.

I tried to get started a li'l bit but felt like it was more or less a fumble, so I'm going to take a step back and study related implementations more closely first. I'm currently focusing on the IntelliJ platform, which likely has the most battle-tested transformation engine out there. Eclipse will also be worth looking into.

I'm focusing on the implementations that actually work well in practice, rather than the one-off research publications.

slarse commented 3 years ago

Looking a little bit more at IntelliJ's PSI, here's one very important reason why we can't adopt their precise methodology

The PSI modification methods do not restrict you in the way you can build the resulting tree structure. For example, when working with a Java class, you can add a for statement as a direct child of a PsiMethod element, even though the Java parser will never produce such a structure (the for statement will always be a child of the PsiCodeBlock) representing the method body). Modifications that produce incorrect tree structures may appear to work, but they will lead to problems and exceptions later. Therefore, you always need to ensure that the structure you built with PSI modification operations is the same as what the parser would produce when parsing the code that you've created.

jrfaller commented 3 years ago

It does not look useful in our use-case, indeed.

slarse commented 3 years ago

I've dug a bit deeper into the internals of the IntelliJ platform. As far as I understand, a refactoring works something like this:

  1. Mutate the PSI. There are various ways to do this, one common way being to generate a PSI from a (text) code snippet and then inserting it somewhere.
  2. For each modification, a change listener converts the updated PSI element to text and makes note of where the changed element should go (the source position/text span).
  3. The document is then synchronized by way of writing the updated text to the precise place it needs to be, displacing any succeeding text but leaving preceding text untouched.

So, in essence, IntelliJ only writes what it absolutely needs to write. Also, when actually printing to a file, there's no actual pretty-printing going on, only precise insertions and deletions based on text position. Of course, this is probably much for optimization reasons, and I do think that it would be possible for IntelliJ to write the entire file with perfect preservation due to the formatting being preserved in the PSI. But, the less we pretty-print, the less chance there is for error. It might be worth experimenting with this kind of approach.

monperrus commented 3 years ago

That reminds me of an old idea we have discussed several times: reimplementing the tree builder on top of JavaC instead of JDT (*). In theory, we should be able to do so while keeping the whole Spoon test suite green to verify that we don't break anything.

Now, let's apply this idea to PSI: we could reimplement the tree builder on top of IntelliJ's PSI (think PSITreeBuilder). By doing so, we may be able to provide full sniper mode by delegating to PSI.

WDYT?

(*) Why would we do that? Because the JavaC community seems more active.

slarse commented 3 years ago

Now, let's apply this idea to PSI: we could reimplement the tree builder on top of IntelliJ's PSI (think PSITreeBuilder). By doing so, we may be able to provide full sniper mode by delegating to PSI.

The PSI (both the API and implementations of it) is part of the IntelliJ platform, and I don't know of any way to get only that part. I think we would need to vendor (include in Spoon) large parts of the IntelliJ platform. It should be possible to get only the Java-specific parts, as those are implemented as a plugin, but I really think it would be rather laborious.

By doing so, we may be able to provide full sniper mode by delegating to PSI.

Not without significant effort. Even if we were able to extract the PSI and pretty-printing parts of IntelliJ in a reasonable manner, we'd then need to enact all modifications both on the Spoon AST, and on the PSI, in order to be able to use it as you suggest. The PSI is pretty laborious to modify, and so I think this is not a good idea.

To summarize, it's probably possible to build on top if IntelliJ, but I think it would be much more work than we have the manpower to perform. I'm also uncertain of the benefits. The PSI is very complicated, and does a lot of things that we don't need (for example by being language-agnostic).

slarse commented 3 years ago

To add to what I wrote above, all tools that I've seen that use the PSI have been created as plugins for IntelliJ, rather than standalone applications.

matthieu-vergne commented 2 months ago

Hi, Currently looking at a way to automate refactoring actions on code. I am already playing with JavaParser and I am currently looking for other alternatives, like Spoon. Just starting from the beginning there, and already facing the issue that when I parse my code and print it without any change, the format is not preserved despite using the SniperJavaPrettyPrinter:

class MyClass{} // input
class MyClass {} // output with additional space

So I wonder what is the current state of this discussion and, more generally, of the high fidelity printing in Spoon?