BlankSpruce / gersemi

A formatter to make your CMake code the real treasure
Mozilla Public License 2.0
152 stars 7 forks source link

Disable formatting via configuration file #35

Closed qweasdzxc787 closed 2 weeks ago

qweasdzxc787 commented 1 month ago

Can we get this as an option in the configuration file. There are some legacy repositories that I work in that don't want CMake files to be formatted, it would be nice to add that in a .gersemirc in those repos so they won't be autoformatted without having to disable the tool globally.

clang-format has similar functionality.

BlankSpruce commented 1 month ago

I need you to elaborate further because perhaps I'm missing something. If you don't want to use formatter then why do you need an option to tell it to not format instead of just not using it? What does it mean for you "to disable tool globally"?

nlebedenco commented 1 month ago

@BlankSpruce This can be useful to avoid accidental changes when you have a complex source tree for which certain subdirectories are known to contain files that should not be formatted for some reason (for example, 3rd party source code, git submodules etc). Mind that in many cases, gersemi might not be directly invoked by the user in a command-line interface but integrated in an IDE, or indirectly invoked by a build automation script or a pre-commit hook so the user may not have a direct way of separating which files should be formatted and which should not in every occasion. Not to mention the risk of accidental changes. And even when those changes can be reverted the user may not be able to do so until a long time has been wasted formatting files that should have never been touched from the start. With clang-format for example we can disable formatting for anything under a certain folder by creating a file that contains:

DisableFormat: true
SortIncludes: Never

And in addition since version 18 we can use .clang-format-ignore for more control, similarly to how .gitignore works.

With cmake-format this could be accomplished by creating a .cmake-format file containing:

format:
  disable: true

I would argue in favor of at least a way to disable formatting under a folder so we can ensure developers will never be format sensitive files by accident regardless of the integration used.

BlankSpruce commented 1 month ago

I'll think about the original request however it'd require to implement finding .gersemirc closest to the formatted file instead of the current approach. You might (or not) have noticed that configuration file is selected from the most common directory among formatted files. Example:

$ pwd
/tmp/my-project

$ gersemi foo/CMakeLists.txt
# candidate configuration files: 
# - /tmp/my-project/foo/.gersemirc
# - /tmp/my-project/.gersemirc
# - /tmp/.gersemirc
# - /.gersemirc

$ gersemi foo/CMakeLists.txt bar/CMakeLists.txt
# candidate configuration files: 
# - /tmp/my-project/.gersemirc
# - /tmp/.gersemirc
# - /.gersemirc

$ gersemi /tmp
# candidate configuration files:
# - /tmp/.gersemirc
# - /.gersemirc

I thought this is good enough approach because I don't think I'd like to have fragmented configuration (this directory formatted like such and such, that directory like such and such). Having mentioned all of that perhaps more suitable approach would be to introduce exclude: <regex>|<regexes> to ignore certain files/directories.

I can see some merit of "disable formatter" now but I need to think what's the best to suit those needs.

qweasdzxc787 commented 1 month ago

I wasn’t aware that each directory could contain a .gersemirc file (documentation clarification needed). With that said, we’d still need a disable formatting flag in the file for the desired behavior.

A debug mode that shows more information about what is being used for formatting could show the path of the found nearest configuration file and what settings are being used would be nice in general, doesn’t look like that’s currently possible with gersemi.

Another nice addition to gersemi would be the ability to recursively process all CMakeLists.txt files found in the current directory and downwards (respecting the above override configuration structure) so you could one shot format an entire projects CMake instead of having to manually format each file. That also helps when a new gersemi formatting option is released or a formatting change is made to a configuration file.

This tool has been very helpful for maintaining a consistent (beautiful!) CMake style format in and across projects. Build systems are often an afterthought or a necessary evil for most developers, but they are a critical, foundational part of the code and deserve the same level of tooling as the code itself. Really appreciate the work @BlankSpruce.

On Mon, Oct 7, 2024 at 01:55 BlankSpruce @.***> wrote:

I'll think about the original request however it'd require to implement finding .gersemirc closest to the formatted file instead of the current approach. You might (or not) have noticed that configuration file is selected from the most common directory among formatted files. Example:

$ pwd /tmp/my-project

$ gersemi foo/CMakeLists.txt# candidate configuration files: # - /tmp/my-project/foo/.gersemirc# - /tmp/my-project/.gersemirc# - /tmp/.gersemirc# - /.gersemirc

$ gersemi foo/CMakeLists.txt bar/CMakeLists.txt# candidate configuration files: # - /tmp/my-project/.gersemirc# - /tmp/.gersemirc# - /.gersemirc

$ gersemi /tmp# candidate configuration files:# - /tmp/.gersemirc# - /.gersemirc

I thought this is good enough approach because I don't think I'd like to have fragmented configuration (this directory formatted like such and such, that directory like such and such). Having mentioned all of that perhaps more suitable approach would be to introduce exclude: | to ignore certain files/directories.

I can see some merit of "disable formatter" now but I need to think what's the best to suit those needs.

— Reply to this email directly, view it on GitHub https://github.com/BlankSpruce/gersemi/issues/35#issuecomment-2396060381, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGANGOJYGAAJVX3V6TODTC3Z2IV7XAVCNFSM6AAAAABPKSLMI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJWGA3DAMZYGE . You are receiving this because you authored the thread.Message ID: @.***>

BlankSpruce commented 1 month ago

I wasn’t aware that each directory could contain a .gersemirc file (documentation clarification needed). With that said, we’d still need a disable formatting flag in the file for the desired behavior.

Documentation will be enhanced on the next opportunity. I will make it clear that the most common configuration file is taken into consideration (in contrast to clang-format which takes the closest one if I recall correctly). Also in my previous answer I'm arguing that exclude will probably be better than just disable due to example provided by @nlebedenco and simply because it won't require me to change selection method of configuration file. I'd need stronger argument for change from "most common config file" to "closest config file".

A debug mode that shows more information about what is being used for formatting could show the path of the found nearest configuration file and what settings are being used would be nice in general, doesn’t look like that’s currently possible with gersemi.

I've been thinking about it for some time so this will happen in some form.

Another nice addition to gersemi would be the ability to recursively process all CMakeLists.txt files found in the current directory and downwards (respecting the above override configuration structure) so you could one shot format an entire projects CMake instead of having to manually format each file. That also helps when a new gersemi formatting option is released or a formatting change is made to a configuration file.

Unless I'm misunderstanding what you mean this has been the reality for quite some time already. In fact, due to startup overhead, it's always been preferable to pass all the files or directories in one invocation to the point I've made sure that there is requires_serial: true in pre-commit hook definition. Similarly the usage description mentions that each positional argument is either file or directory. These are perfectly fine invocations:

# check all CMakeLists.txt and *.cmake files in this or child directory
$ gersemi -c . 

# check 
# - all CMakeLists.txt and *.cmake files in foo and bar directories (assuming these are directories)
# - ./CMakeLists.txt
# - baz/foo/utilities.cmake
$ gersemi -c foo CMakeLists.txt baz/foo/utilities.cmake bar

# check each file in separate invocation (less efficient but possible)
$ gersemi -c CMakeLists.txt
$ gersemi -c baz/foo/utilities.cmake
$ gersemi -c whatever.cmake

I'd recommend having only one .gersemirc per project, like at project's root, so that these invocations are equivalent in outcome.

nlebedenco commented 1 month ago

I'd recommend having only one .gersemirc per project, like at project's root, so that these invocations are equivalent in outcome.

This cannot be controlled in projects with external dependencies. If I have a project A that depends on project B and it has its own .gersemirc I'll be locked out. For example:

Project A
  |__ .gersemirc
  |__  Project B
          |__ .gersemirc
  |__  Project C
    ...

Also consider cases where the files passed do not share a common root (which can happen on Windows) or only share the system root (/) which can happen on Unixes. This is very common in setups where CMake modules are stored in a separate repo shared by multiple other projects in the same company.

The common root approach is probably too opinionated and too simplistic. Note that even python Black suffers from it (aggravated by the fact that they decided to assume a .git folder also marks the project root regardless of the path passed in the --config argument, again so simplistic it makes it impossible even to define proper exclusion rules because Black internally matches the regex to relative paths based on what it thinks is the project root even when the files passed in the command-line have absolute paths).

BlankSpruce commented 1 month ago

@nlebedenco Insightful comment. Let's break it down.

This cannot be controlled in projects with external dependencies. If I have a project A that depends on project B and it has its own .gersemirc I'll be locked out.

As far as I can imagine these would be usage patterns: 1) Projects on which we (Project A) depend are not edited within the scope of our project, for example it's pulled in as git submodule.

In this case I'd propose (once I implement that, regex syntax is subject to change):

## Project A/.gersemirc
# ... other configuration knobs
exclude: 
- "\./Project B/.*"
- "\./Project C/.*"

Since we pull it in maybe we could assume (or orchestrate) that whoever maintains these dependencies have already formatted them. Even if for whatever reason they do have different rules than we do (let's say line_length) this should be enough.

If these dependencies aren't formatted it probably should be avoided to reformat them nevertheless because (as in git submodule approach) it might be inconvenient to have formatting overwritten any time dependencies are updated.

2) Project on which we depend are edited within our scope. In this case dependencies are essentially subprojects.

a) Subprojects do follow the same formatting rules. 
Great. One configuration file, win-win.

b) Subprojects do not follow the same formatting rules. Examples:
- sometimes previously separate projects are merged into monorepo
- different teams developed stuff separately
- we made a direct copy of some code into our repository because of "reasons" (I've seen that)

This could be point of contention. Here's my question: what should be prioritized? Maintaining status quo of visibly different formatting rules or making that final leap and commonizing formatting rules with superproject? Personally I'd favour the latter but I also understand your milage may vary. 

If your case is exactly (2b) then I'm interested if you have opposite view and if you can expand on what I'm missing.

Also consider cases where the files passed do not share a common root (which can happen on Windows) or only share the system root (/) which can happen on Unixes. This is very common in setups where CMake modules are stored in a separate repo shared by multiple other projects in the same company.

This is really interesting model of sharing code and I don't judge you, all kinds of stuff happens. Nevertheless it feels like this somewhat follows usage pattern (1) mentioned above. I don't think these separate projects should be formatted with single invocation but rather one invocation per project with definition of project as in Wikipedia.

The common root approach is probably too opinionated and too simplistic. Note that even python Black suffers from it (aggravated by the fact that they decided to assume a .git folder also marks the project root regardless of the path passed in the --config argument, again so simplistic it makes it impossible even to define proper exclusion rules because Black internally matches the regex to relative paths based on what it thinks is the project root even when the files passed in the command-line have absolute paths).

I will have to read further about Black problems because I might be simply unaware of issues ahead. I can note that I've made a choice to always normalize paths to sources and definitions to operate on absolute paths. If I'm going to go with exclude knob I'm assuming that it will allow for path regexes relative to configuration file in the file or relative to CWD in case of command line supplied regexes and those will be normalized (with some currently unclear definition of normalization for regexes) as well.

This is very productive and very interesting issue and I hope we reach a good solution. I'm not denying possibility of "closest config file" but I want to see how far I can go with what's already present. :)

BlankSpruce commented 1 month ago

Regarding closest configuration file approach there are a couple of things that aren't obvious to me how should I deal with them. Let's consider the following structure :

project
├── bar
│   ├── CMakeLists.txt
│   └── .gersemirc # [A]
├── CMakeLists.txt
├── foo
│   ├── CMakeLists.txt
│   └── something_else.txt
├── .gersemirc # [B]
└── something.cmake

Let's consider this invocation inside project directory:

$ gersemi --check .

1) What should be done if there is workers: 4 in [A] config and workers: 1 in [B] config? Should it lead to checking some set of files with 4 workers and some set of files with 1 worker? 2) What about color as in $ gersemi --diff .? Let's say there's color: true in [A] and color: false in [B]. Should the output be partially colored? Completely colored? Not colored at all? 3) Not as weird but similar things are to be considered about unsafe or quiet.

This indicates that there are two kinds of settings at play:

Perhaps I shouldn't conflate in one file so different things but also as a user I must say it's damn convenient this way. Perhaps behaviour affecting the settings should only follow default < command line chain and those affecting outcome default < configuration file < command line. I'm not comfortable enough right now in breaking that already established modus operandi even if it has some other shortcomings.

Regarding disable: I'm slowly convincing myself that disable is superior approach than exclude because proper selection of files can always be implemented by user in a script in any way they see fit. For example: $ gersemi CMakeLists.txt *.cmake foo will perfectly deal with excluding bar directory without explicitly listing all the files in foo directory. At the same time disable would be rather convenient for single file usage like it happens in editors and IDEs. However the big caveat is that without changing approach from "most common configuration file" to "closest configuration file" it's not going to be useful in practice. I've backed myself into the corner and I don't know right now how to make it nice but not disruptive nor surprising to current users.

petk commented 1 month ago

Hello, as this topic is opened. Would it be also possible to add a command-line configuration option to select specific configuration .gersemirc file from a given location?

Perhaps something like:

gersemi --config path/to/.gersemirc src ...
BlankSpruce commented 1 month ago

@petk Once I decide on how to proceed with feature requested in this issue I'll definitely consider --config. I don't want to take care of extra work before that happens and as you can see there is no clear winner yet. I need some time to ponder but it's a bit difficult right now for me.

BlankSpruce commented 3 weeks ago

On next-release branch there is new version of gersemi with the following changes:

I haven't done --print-config and improvements to README yet but I intend to finish that before I release next version. I'm going to postpone that release to have some time for extra testing. Any information about bugs related to aforementioned features or unclear description/improper wording in --help will be appreciated.

BlankSpruce commented 2 weeks ago

All the things in my previous comment and a couple more are available in 0.17.0. If there is any problem with these new features please open new issue instead of commenting here.

petk commented 2 weeks ago

Thank you so much @BlankSpruce according to some quick testing it works very nicely