Adding autoformatting - Githubissues

VHDL-LS / rust_hdl

Other

329 stars 64 forks source link

Adding autoformatting #296

Closed KarelPeeters closed 1 month ago

KarelPeeters commented 4 months ago

Intro

I'm investigating how hard it would be to add automating as a feature, as part of https://github.com/VHDL-LS/rust_hdl/issues/156. This project already has a parser, now we "only" need to add the other direction :slightly_smiling_face:.

Inspiration can be taken from rustfmt, notably Design.md and the more up-to-date Contributing.md.

Current state

I noticed the file display.rs has implementations for some, but not all, AST nodes. As far as I can tell statements and anything that would require them (up to and including AnyDesignUnit::Secondary) does not yet have a Display implementation. Is the intention that this gets extended to cover the entire AST?

Some limitations of the current design:

Indentation is hard-coded per AST node type, which does not extend to arbitrarily nested expressions.
Parenthesis insertion is decided based on precedence levels, potentially overriding user intent.
Comments are dropped, these need to be preserved.
There is no line wrapping, and it's not obvious how to add this.

Future

Some decisions need to be made:

Should the formatter be whitespace only or more all-encompassing (including line formatting and blank lines) like rustfmt?
- I think white-space only is much easier to implement, since we don't need to make any complicated line-breaking decisions.
How much information do we want to add to the AST to enable formatting? For example, comments can be handled in two ways: either add them to the AST and then emit them again, or keep them in a separate data structure and re-insert them during stringifying. The same is true for line breaks.

I'll do some more research and maybe start on a prototype, I just wanted to start the discussion as early as possible.

kraigher commented 4 months ago

First of all the Display trait is a dead end for formatting. To do proper formatting you need to have a context such as the current indentation level and probably also settings.

How much information do we want to add to the AST to enable formatting? For example, comments can be handled in two ways: either add them to the AST and then emit them again, or keep them in a separate data structure and re-insert them during stringifying. The same is true for line breaks.

The current architecture has been to not include comments in the AST but rather have each AST be aware of the span of Tokens it consists of. The comments can be extracted from the token span and be kept in this way. Each Token is associated with either leading or trailing comments. The comments are stored in the tokens and are not tokens themselves.

This separation between AST and Tokens is to keep the different layers of the architecture simpler and focused om their role. To have all information in one big structure would be very complex and also all information is not availablein a single pass. VHDL-LS has three clear layers:

Tokens, the result of tokenization
AST, the result of parsing
Named entities, the result of semantic analys (type checking etc)

Layers willhave references to each other, for example AST will reference a span of tokens. AST will also reference named entities which are either decarations or uses of for example signals, entities etc.

Should the formatter be whitespace only or more all-encompassing (including line formatting and blank lines) like rustfmt?

Rust format is a good guiding star to lead the way, its a very nice tool. Of course you got to start somewhere and it will not be as good as rustfmt in the first iterations.

Schottkyc137 commented 4 months ago

Just want to add that I am very interested in pushing this forward. Should anyone need any help (and I know that implementing such capabilities can mean a lot of work), feel free to contact me

KarelPeeters commented 4 months ago

Thanks for the feedback @kraigher and for the offer @Schottkyc137. I'm interested in working on this, I'll start doing some initial experiments and see if I get stuck anywhere.

Schottkyc137 commented 4 months ago

First of all the Display trait is a dead end for formatting. To do proper formatting you need to have a context such as the current indentation level and probably also settings.

How much information do we want to add to the AST to enable formatting? For example, comments can be handled in two ways: either add them to the AST and then emit them again, or keep them in a separate data structure and re-insert them during stringifying. The same is true for line breaks.

The current architecture has been to not include comments in the AST but rather have each AST be aware of the span of Tokens it consists of. The comments can be extracted from the token span and be kept in this way. Each Token is associated with either leading or trailing comments. The comments are stored in the tokens and are not tokens themselves.

This separation between AST and Tokens is to keep the different layers of the architecture simpler and focused om their role. To have all information in one big structure would be very complex and also all information is not availablein a single pass. VHDL-LS has three clear layers:

Tokens, the result of tokenization

AST, the result of parsing

Named entities, the result of semantic analys (type checking etc)

Layers willhave references to each other, for example AST will reference a span of tokens. AST will also reference named entities which are either decarations or uses of for example signals, entities etc.

Should the formatter be whitespace only or more all-encompassing (including line formatting and blank lines) like rustfmt?

Rust format is a good guiding star to lead the way, its a very nice tool. Of course you got to start somewhere and it will not be as good as rustfmt in the first iterations.

One thing that currently hinders a straightforward implementation of such features is that tokens are not always part of the AST. Instead, some AST elements are enriched with the source code information. Do you think that, in the long run, this source code information should entirely be replaced with the token information? This is a pretty significant change which encompasses the whole stack, making the implementation fairly difficult. Is there a better way?

kraigher commented 4 months ago

One thing that currently hinders a straightforward implementation of such features is that tokens are not always part of the AST. Instead, some AST elements are enriched with the source code information. Do you think that, in the long run, this source code information should entirely be replaced with the token information? This is a pretty significant change which encompasses the whole stack, making the implementation fairly difficult. Is there a better way?

Yes I think all AST nodes should have information on which token span they come from in the form of 32-bit indexes into the tokens of the file. In most cases this would mean you could remove the WithPos etc from AST nodes.

Sure it will be a lot of grinding to achive this but I do not see a better way forward. Whay would be the alternative? Somehow adding comments directly to the AST would be even more work, less elegant and would not preserve information about other aspects that a formatter would care about such as extra paranthesis and casing.

Shivlocin commented 2 months ago

Maybe this project is relevant to the conversation.

SethGower commented 1 month ago

Maybe this project is relevant to the conversation.

There has already been some discussion over there from @kraigher https://github.com/jeremiah-c-leary/vhdl-style-guide/issues/920

jeremiah-c-leary commented 1 month ago

Morning everyone,

Maybe this project is relevant to the conversation.

I am the creator of this project and have spent over 6 years on it's development. I believe I am at the point where I need to re-evaluate my implementation and was wondering if using rust_hdl would be an option to build upon.

Honestly not sure where to start, but maybe this question will do:

Is there an API documented somewhere? I tried, briefly, to look for something but I did not see a link to one.

Regards,

--Jeremy

Schottkyc137 commented 1 month ago

Unfortunately there is no stable API yet. Attempts have been made, but not much has happened since quite some time. One of the main reasons (at least according to what I think) is that there are still a few fundamental design decisions that have to be made before one can publish an API. For example, for the formatter, a lot of changes concerning the AST were necessary, mostly to incorporate source locations into the nodes. You will find the exported documentation for what is available here, but note that what is documented is often lacking.

jeremiah-c-leary commented 1 month ago

Thanks @Schottkyc137 ,

I looked over this thread and this pull request. It looks like some form of output formatting has been implemented.

How do you see this collaboration working out? My knowledge of rust is limited at best, so code contributions would be out of the question. I do have a lot of experience on the user side of formatting VHDL code, what users are looking for and their use cases.

--Jeremy

Schottkyc137 commented 1 month ago

So I'm not quite sure what the best model is here. But what is lacking from the formatter as designed in #321 is some way of enabling user input. So basically, what I am looking for is a "configuration blueprint". What is it that users actually want to configure? I can obviously create this myself, but it would probably reflect my preferences / style. Therefore, I am hoping to somehow build on your expertise in this field. The actual implementation of capabilities described in the config file should actually not be too hard (or at least way easier than building the full formatter, which has been done now)

On of the problem that I see with this "proposal" is that there is a lot of commonality in this project and yours. Essentially, after all is done, vhdl_lang would be able to replace your project. Cannibalising other open source projects is not something that I aim for. Unfortunately, I'm not sure what the best way forward is here. With my opinion on the matter being outlined, do you see any potential in a mutually beneficial collaboration?

jeremiah-c-leary commented 1 month ago

Morning @Schottkyc137 ,

So I'm not quite sure what the best model is here. But what is lacking from the formatter as designed in https://github.com/VHDL-LS/rust_hdl/pull/321 is some way of enabling user input. So basically, what I am looking for is a "configuration blueprint". What is it that users actually want to configure? I can obviously create this myself, but it would probably reflect my preferences / style. Therefore, I am hoping to somehow build on your expertise in this field.

Users want to configure everything. VHDL does not have a standard formatting guide similar to Python's PEP8. All formatting decisions are left to the user, and like anything left to the users there will be variations. If you have not reviewed the documentation for VSG, here are some examples:

We went through a whole discussion on tabs vs spaces, treating prefixes and suffixes of names differently than the base name. Some requests have been quite surprising to me as I would never have thought to format code in requested manner.

I wrote some documentation on how VSG rules should be created. I wish I had taken the time to create that document before I started coding. I believe there are 900 rules for formatting code, but that is a small fraction that could be created.

On of the problem that I see with this "proposal" is that there is a lot of commonality in this project and yours. Essentially, after all is done, vhdl_lang would be able to replace your project. Cannibalising other open source projects is not something that I aim for. Unfortunately, I'm not sure what the best way forward is here.

I was curious about the roadmap for vhdl_lang. The first goal in the README states:

This project aims to provide a fully featured open source VHDL frontend that is easy to integrate into other tools.

Is adding formatting to vhdl_lang going against this goal? I was under the impression it would parse and lex VHDL files and then provide a data structure which could then be queried and manipulated by another application. For example, I was contacted about using the parser out of VSG to help autogenerate some documentation. I could see applications that create block diagrams, track code statistics, get compile order and register documentation.

The key to enabling other use cases is a good API. If the formatting feature you are adding is a separate tool, then it would force the development of the API so others can use it effectively. At the moment, the formatting is within vhdl_lang, so the community does not benefit. One might argue the best way forward would be to update VSG to use vhdl_lang as its parser.

As for cannibalizing VSG, I have mixed feelings. It has been an amazing journey for me and has pushed my coding skills forward. I have had the opportunity to talk with people from around the world and at places like CERN. At the same time it has taken a large amount of my time and I can not focus on other areas of interest:

Cannibalizing would only occur if formatting was added to vhdl_lang. Whether it should be added would be determined by the road map for vhdl_lang. Without a road map, there is nothing to say what is in or out of scope for this project.

With my opinion on the matter being outlined, do you see any potential in a mutually beneficial collaboration?

I am more than willing to share my experiences in creating VSG and any lessons learned.

Would you want to start with the https://github.com/VHDL-LS/rust_hdl/pull/321 pull request. I do have questions about the data structure and opinions on how the formatting should be implemented.

Regards,

--Jeremy

Schottkyc137 commented 1 month ago

Good evening @jeremiah-c-leary

I wrote some documentation on how VSG rules should be created.

I read the document on how to create rules. This is really interesting and might give a good lead. I have also read some of the documentation. It's truly amazing how many different styles there are :D

Is adding formatting to vhdl_lang going against this goal?

In my opinion, it's not. There are two main reasons:

It is the goal of vhdl_lang to be a standalone tool / API, but moreover it is built to support the capabilities of (without actually being dependent on) the Language Server Protocol (LSP) with vhdl_ls being the reference implementation. As LSP supports formatting, vhdl_lang should support that too.
One potential use-case is source-to source conversion, for example, to convert a VHDL 2019 file to VHDL 2008 file (or even earlier). I believe that such use-cases could already be implemented, based on the AST only. However, after one obtains a modified AST, printing that AST to a file is currently not natively supported. Formatting would allow this.

In general, the roadmap resp. goals of this project are – broadly speaking 1) To be a free and open source language tool for VHDL. 2) To provide an efficient language front-end for other tools.

The use-cases you outlined (e.g., block diagrams, code statistics, compile order, e.t.c.) fall under the second category. I find both of these use-cases interesting, but will mostly work on the former (out of personal interest). That being said, I'm also more than happy to help and put work into creating a good API if there are concrete use-cases and helpful user input.

Cannibalizing would only occur if formatting was added to vhdl_lang. Whether it should be added would be determined by the road map for vhdl_lang. Without a road map, there is nothing to say what is in or out of scope for this project.

I will definitely add the formatting capabilities (and I have merged the PR). It's simply too much work that has gone into the tool. In what capacity this is going to be integrated into the whole eco-system is another question though.

Would you want to start with the https://github.com/VHDL-LS/rust_hdl/pull/321 pull request

Even though it's already merged I would really love to hear your opinions and I'm happy to answer any questions of yours.

Best, Lukas

LukasVik commented 1 month ago

Hello Lukas, Jeremy, and others,

I tried the formatting from the latest master branch. Worked well on the 30k lines of code across 160 files that I tested it on. Really cool to see, great job!

Some thoughts:

I think that supporting user-configured style is a black hole of endless work that will never be finished. I prefer the approach of the Python formatter black with a single pre-defined style and almost no user switches.

a. I think that having a fast, functional and reliable formatter is the main goal of most people. Having it produce a consistent style of correctly-formatted code is more important than producing "exactly the exotic style that I prefer".

b. Supporting user-configured style settings adds an incredible amount of complexity. It is highly probable that this complexity will result in a tool that is not reliable.

c. In the end, I don't think everyone is gonna be happy anyway. There's always gonna be corner cases and people with crazy preferences. For me, it's a lesson learned in life and in open source that you're never gonna make everyone happy. Better to focus on making 80% of people happy.
In my opinion, step one should be to have a fast and stable formatter that produces good-looking code. Current master is almost all the way there. Step two should be to add support for automatic line-breaking of statements so the file fits within a user-specified line length. Like black does.

a. To me, this is a way more interesting feature, and adds way more value, than supporting 10 M different combinations of user style preferences.
I think that tools and projects should be neatly scoped. To avoid scope creep, bloat, and to keep complexity down. Think specifically here about the division between formatter and linter. For example, rules about e.g. generic/port/signal name prefixes do not, IMO, belong in a formatter.
From my perspective, it makes more sense if the formatter is a separate project with a separate binary. Some of my reasoning:

a. Some tools that could be built on rust_hdl: formatter (like black), linter (like pylint), import sorter (like isort), documentation extractor, 2019-to-2008 converter, etc. I don't think anyone can make the case that all of this should be in the rust_hdl repo and compiled into the vhdl_lang binary. Most people would argue that they should be separate projects with separate binaries. And if some of them should be separate then I think, logically, all of them should.

b. Main reason for this is of course scoping and complexity.

b. In a typical VSCode Python setup, most LSP duties are handled by pylance, but formatting is handled by black. Nothing wrong with this division.

c. It has been a large historical problem in the VHDL world that there is no generic reliable parser that we can build tools on. Instead every tool, e.g. each formatter, has built its own (limited and application-specific) parser first. Making a formatter in the same repo and tightly-knit with the parser seems to me like repeating a historical mistake.

d. Making the formatter as a separate tool is a great way to showcase how to build "third-party" tools on top of rust_hdl.

I would be very happy to assist in discussions, code reviews, trying new things out, etc. Other than that, I don't really have the Rust knowledge nor the computer science or compiler tech background to understand these things. And I can't allocate time for even more open source work. So unfortunately, I won't contribute any code, but I hope I can be of assistance anyway.