Open dsouzai opened 3 months ago
Thanks for undertaking this. The formatting looks generally fine to me, and the visual delta between what we have now and the new format doesn't really stand out (which is a good thing, in my opinion).
A few initial questions: 1) The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:
TR::Block* block = NULL;
TR::Block *firstColdBlock = NULL, *firstColdExtendedBlock = NULL;
where the pointer definitions differ in the placement of the *
. There will likely be a number of other inconsistencies that we encounter as we go through the code base.
2) Can anything be done about the Doxygen comments to make them more consistently formatted? The project should pick a style and apply it throughout (e.g., @brief vs \brief
, and the formatting of text within the comment)
3) The formatting of comments in general seems inconsistent (at least in terms of line length). Was that intentional, and can it be changed? If it could be changed, we will have to be a lot more diligent reviewing the changes to be sure that comments that shouldn't wrap don't to keep the formatting correct.
@0xdaryl
- The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:
In situations like this, I think we could either 1) change clangformat to right align all the *
, or 2) just fix these as we see them in future. We could also then have a blurb in the coding standard to declare only one variable of a pointer type on single line. e.g.
TR::Block* block = NULL;
TR::Block* firstColdBlock = NULL;
TR::Block* firstColdExtendedBlock = NULL;
@dsouzai your Fix up multiple namespace declarations
script inserts a bunch of white spaces after the newlines.
The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:
Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.
if (condition)
statement;
else
statement2;
versus
if (condition) {
statement;
} else {
statement2;
}
and
switch (TR::Compiler->om.writeBarrierType()) {
case gc_modron_wrtbar_oldcheck:
case gc_modron_wrtbar_cardmark:
case gc_modron_wrtbar_cardmark_and_oldcheck:
case gc_modron_wrtbar_cardmark_incremental:
break;
default:
needWriteBarrier = false;
break;
}
versus
switch (TR::Compiler->om.writeBarrierType()) {
case gc_modron_wrtbar_oldcheck:
case gc_modron_wrtbar_cardmark:
case gc_modron_wrtbar_cardmark_and_oldcheck:
case gc_modron_wrtbar_cardmark_incremental: {
break;
}
default: {
needWriteBarrier = false;
break;
}
}
Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.
For if, else, for, do, while
clang-format 15+ has an InsertBraces
option that should address this. For the switch cases
, I am not aware of any tool that does that, it'd need a deeper understanding of the intent of the code since the braces on the case
are only added for scoping variables defined within the case and I doubt we'd want to enforce those braces all of the time.
While we are here, maybe we could also explore the clang-tidy static analysis tool in addition to clang-format. (list of clang-tidy checks)
the braces on the case are only added for scoping variables defined within the case and I doubt we'd want to enforce those braces all of the time.
I guess that's a matter of taste. I prefer to think of each case as a block of code and like that to be emphasized - hence, I almost always put braces around the blocks of code in the cases. If most others prefer not to do that, that's fine and I can live with that, but I think we should have a consistent approach. Both styles appear in the compiler today, even in the absence of variable declarations - though I don't know how many of the braces without variable declarations might have been introduced by me. :-)
I guess that's a matter of taste.
I also prefer the brace scopes for eachcase
. The issue just may be -- if there was some automated way to apply these scopes -- that some code exists that doesn't have the scopes and defines a variable that is used in multiple case
s below that case
. In this situation, the semantics would break. I doubt that this is actually done anywhere -- or many places -- in the code base though.
I don't know how many of the braces without variable declarations might have been introduced by me. :-)
No harm here!
I think this is another example of how the coding standard should be updated to state that all case
blocks that are not solely fallthrough case
s should be brace-scoped. And if a variable needs to be read or written in multiple case
s, it should be defined in a parent scope.
I think one of the goals of this exercise is to apply consistency in formatting rules throughout the code base. Each of us has different perspectives for how we prefer things to look (and we'll likely never reach consensus on), but I think we will all agree that regardless of the style chosen it should be applied consistently throughout the code base [1].
Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style. @dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?
@dsouzai your Fix up multiple namespace declarations script inserts a bunch of white spaces after the newlines.
@ThanHenderson Yeah I had noticed that, but then I figured since the namespace merging would happen in the same PR as the code formatting, it would get washed away as part of the formatting. That said, I realized that the regex was relatively straightforward to fix, so I updated it to remove spaces before and after.
@dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?
I haven't spent too much time with this tool aside from the work needed to create this issue; I mostly just piggy-backed off the work of others in the past. I'll have to read up the documentation and play around with it more to see if I can answer some of the questions brought up above.
My initial guess though is that we likely won't be able to smooth out absolutely everything in one go, and that some things will need some manual intervention on an ongoing basis. However, that shouldn't prevent us from going ahead with this for a couple of reasons
I'll play around with it though and see. If anyone else also has experience with clang-format, I'd appreciate your insights.
Heh, I should mention, since I only looked at the diff from https://github.com/eclipse/omr/pull/3390, I missed this bit in the clang format file that may have impacted how the code looks:
Language: ObjC
I think the webkit file has expanded since Robert looked at it way back in 2018. Once I go through all the various config options, I'll run the formatter again and see how it looks.
Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style
The syntax inconsistencies should be minimal after clang-format is applied; provided we specify all of the rules that we care about in the .clang-format
file.
I think the important thing to do here is identify exactly the inconsistencies that we find, and devise the appropriate rules for clang-format in light of them. Where there any other inconsistencies that people noticed?
For example, though it may not be everyone's preference, going with option 2 here should eliminate all inconsistencies with pointer types.
Certain other inconsistencies, like the braced-case
blocks the Henry mentioned, are not only syntax formatting problems; a change to the syntax may change the semantics. A formatter cannot handle those cases and thus the coding standard will need to be more specific about the opinions there.
w.r.t. the comment formatting, clang-format has some comment formatting options and general options that also apply to comments. However, they are pretty basic.
Heh, I should mention, since I only looked at the diff from https://github.com/eclipse/omr/pull/3390, I missed this bit in the clang format file that may have impacted how the code looks:
That shouldn't affect it since that section was separated by a ---
.
I went through the clang-format documentation, and played around with the config a bit. You can look at the latest branch https://github.com/dsouzai/omr/tree/clangformat_20240813. I'm using
clang-format version 18.1.8
as it is the latest stable release.
An important fact I came across from reading online is it is important everyone use the same version of clang-format. Apparently, the output of newer versions of clang-format will not be a strict superset of the output of older versions. Therefore, whatever version of clang-format is chosen will need to be the one we use for good.
I had to do
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
tar -xvf clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
sudo apt-get install libtinfo5
to run clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04/bin/clang-format
.
The diff of the clang-format file is now
1,2c1
< ---
< # BasedOnStyle: WebKit
---
> BasedOnStyle: WebKit
4a4
> AlignArrayOfStructures: Right
5a6
> AlignConsecutiveBitFields: false
7c8,9
< AlignEscapedNewlines: Right
---
> AlignConsecutiveMacros: false
> AlignEscapedNewlines: Left
9a12
> AllowAllArgumentsOnNextLine: false
12a16
> AllowShortEnumsOnASingleLine: false
14a19
> AllowShortLambdasOnASingleLine: true
19c24
< AlwaysBreakTemplateDeclarations: No
---
> AlwaysBreakTemplateDeclarations: MultiLine
22c27,29
< BraceWrapping:
---
> BitFieldColonSpacing: After
> BraceWrapping:
> AfterCaseLabel: false
24c31
< AfterControlStatement: false
---
> AfterControlStatement: Never
32a40,41
> BeforeLambdaBody: false
> BeforeWhile: false
34,36c43,48
< SplitEmptyFunction: true
< SplitEmptyRecord: true
< SplitEmptyNamespace: true
---
> SplitEmptyFunction: false
> SplitEmptyRecord: false
> SplitEmptyNamespace: false
> BreakAdjacentStringLiterals: true
> BreakAfterAttributes: Never
> BreakArrays: false
38,39c50,51
< BreakBeforeBraces: WebKit
< BreakBeforeInheritanceComma: false
---
> BreakBeforeBraces: Custom
> BreakBeforeInlineASMColon: Never
41d52
< BreakConstructorInitializersBeforeComma: false
43c54
< BreakAfterJavaFieldAnnotations: false
---
> BreakInheritanceList: BeforeComma
45,48c56,58
< ColumnLimit: 0
< CommentPragmas: '^ IWYU pragma:'
< CompactNamespaces: false
< ConstructorInitializerAllOnOneLineOrOnePerLine: false
---
> ColumnLimit: 120
> CommentPragmas: '^ IWYU pragma:|SPDX-License-Identifier:'
> CompactNamespaces: true
53a64,65
> EmptyLineAfterAccessModifier: Never
> EmptyLineBeforeAccessModifier: LogicalBlock
55,56c67,68
< FixNamespaceComments: false
< ForEachMacros:
---
> FixNamespaceComments: true
> ForEachMacros:
59a72
> IncludeBlocks: Preserve
73c86,91
< IndentCaseLabels: false
---
> IndentAccessModifiers: false
> IndentCaseBlocks: false
> IndentCaseLabels: true
> IndentExternBlock: NoIndent
> IndentGotoLabels: false
> IndentPPDirectives: None
76,78c94,99
< JavaScriptQuotes: Leave
< JavaScriptWrapImports: true
< KeepEmptyLinesAtTheStartOfBlocks: true
---
> InsertBraces: false
> InsertNewlineAtEOF: true
> KeepEmptyLinesAtEOF: true
> KeepEmptyLinesAtTheStartOfBlocks: false
> Language: Cpp
> LineEnding: LF
83,85c104,105
< ObjCBlockIndentWidth: 4
< ObjCSpaceAfterProperty: true
< ObjCSpaceBeforeProtocolList: true
---
> PPIndentWidth: 0
> PackConstructorInitializers: Never
93,95c113,121
< PointerAlignment: Left
< ReflowComments: true
< SortIncludes: true
---
> PointerAlignment: Right
> QualifierAlignment: Leave
> ReferenceAlignment: Right
> ReflowComments: true
> RemoveBracesLLVM: false
> RemoveParentheses: Leave
> RemoveSemicolon: false
> SeparateDefinitionBlocks: Always
> SortIncludes: false
97a124
> SpaceAfterLogicalNot: false
98a126
> SpaceAroundPointerQualifiers: Before
99a128
> SpaceBeforeCaseColon: false
100a130,131
> SpaceBeforeCtorInitializerColon: true
> SpaceBeforeInheritanceColon: true
101a133,134
> SpaceBeforeSquareBrackets: false
> SpaceInEmptyBlock: false
106a140,142
> SpacesInLineCommentPrefix:
> Minimum: 1
> Maximum: -1
109c145
< Standard: Cpp11
---
> Standard: c++03
112,115d147
< ---
< Language: ObjC
< PointerAlignment: Right
< ...
- The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:
TR::Block* block = NULL; TR::Block *firstColdBlock = NULL, *firstColdExtendedBlock = NULL;
where the pointer definitions differ in the placement of the
*
. There will likely be a number of other inconsistencies that we encounter as we go through the code base.
After looking at the docs, as @ThanHenderson said, using
PointerAlignment: Right
is the easiest solution for this.
2. Can anything be done about the Doxygen comments to make them more consistently formatted? The project should pick a style and apply it throughout (e.g.,
@brief vs \brief
, and the formatting of text within the comment)
It doesn't look like clang-format supports anything like that. However, doing
$ find ./compiler -iname '*.h' -o -iname '*.hpp' -o -iname '*.cpp' | xargs perl -ne 'print "$1\n" if /(\\\S+)/g' | grep -v '\\\(n\|t\|x\|"\|/\|0\|_\||\|\\\|#\|@\)' | sort -u | uniq
\TODO:
\a
\attention
\b
\brief
\c
\class
\code
\def
\deprecated
\details
\em
\endcode
\endverbatim
\example
\file
\ingroup
\p
\page
\par
\param
\param[in,out]
\param[in]
\param[out]
\parm
\pre
\ref
\return
\returns
\section
\see
\structure
\varbatim
\verbatim
\warning
and
$ find ./compiler -iname '*.h' -o -iname '*.hpp' -o -iname '*.cpp' | xargs perl -ne 'print "$1\n" if /(@\S+)/g' | grep -v '@\(@\|%d\|%p\|[[:digit:]]\)' | sort -u | uniq
@</tt>
@ASSERT
@ForceInline",
@Stable",
@TODO:
@arm.com/
@brief
@brief:
@c
@class
@code
@ddr_namespace:
@deprecated
@details
@endcode
@endverbatim
@exit
@file
@id
@name
@name:
@note
@note:
@p
@param
@param[In]
@param[Out]
@param[in,out]
@param[in/out]
@param[in]
@param[in]:
@param[inout]
@param[out]
@param[rn]
@parm
@ref
@return
@return:
@returns
@sa
@see
@struct
@todo
@tparam
@tparm
@tt:
@verbatim
@{
@}
shows a handful of doxygen keywords in the compiler dir. It should be fairly straightforward to do a search and replace of either the \<keyword>
or @<keyword>
3. The formatting of comments in general seems inconsistent (at least in terms of line length). Was that intentional, and can it be changed? If it could be changed, we will have to be a lot more diligent reviewing the changes to be sure that comments that shouldn't wrap don't to keep the formatting correct.
I believe even the original clang-format config file enabled reflowing comments. According to the doc, the reflow respects the columnLimit, which is set to 120. As such, the comments should be consistent at least wrt to the column limit. Now, we could lower the columnLimit to say 80 or something like that, but I was just using what was in https://github.com/eclipse/omr/pull/3390.
Do you have an example of the inconsistency?
Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.
Since clang-format v15, you can specify InsertBraces
:
Insert braces after control statements (if, else, for, do, and while) in C++ unless the control statements are inside macro definitions or the braces would enclose preprocessor directives.
However, it comes with the warning
Setting this option to true could lead to incorrect code formatting due to clang-format’s lack of complete semantic information. As such, extra care should be taken to review code changes made by this option.
So it's not something that can be done when applying to a big chunk of code as we're proposing to do.
Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style. @dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?
After looking at the docs and online, all the inconsistencies brought up in this issue thus far can't be addressed in one go; it'll have to be something that we clean up as we come across them. Really, only the pointer alignment issue can be addressed.
However, it comes with the warning Setting this option to true could lead to incorrect code formatting due to clang-format’s lack of complete semantic information. As such, extra care should be taken to review code changes made by this option. So it's not something that can be done when applying to a big chunk of code as we're proposing to do.
InsertBraces
was pulled from clang-tidy; we could explore clang-tidy too for things that involve semantics and not just white space.
Prompted by a discussion with @hzongaro on a PR I opened yesterday, Would it be possible to add a rule which enforces comments documenting conditional directives as outlined in the OMR C and C++ Coding Standards? I know the coding standards are more of a loose guideline throughout many components of the codebase, but these comments do appear frequently in both OMR and OpenJ9, and I personally find them very useful for readability. For example:
#if defined(A) || defined(B)
...
#elif !defined(C) /* defined(A) || defined(B) */
...
#else /* !defined(C) */
...
#endif /* defined(A) || defined(B) */
I'm working on a Python script that does exactly this and so far it appears to be feasible to do programmatically. I was intending to use it to do a one time sweeping change to introduce missing comments or fix incorrect ones, but maintaining these comments would be sort of inconvenient without a formatting tool like the one we're planning on using here. On the other hand, from some cursory research, it doesn't really seem like the kind of thing that clang-format does. Perhaps those who are more familiar with clang-format know whether this sort of thing is possible.
I think it's a good idea in principle. I'm not familiar with a way to do this via clang-format, and I don't imagine there would be one. A nice stack-based approach for adding in the comments may suffice for the initial pass, and taking care to fix anything suspicious.
Once that initial pass is done, for a CI solution we could scan for commentless #elif
, #else
, and #endif
directives and emit a warning if they are bare, or just run the script to fix it up automatically. The problem with the latter I guess is if we want to allow exceptions to this rule (though I can't think of reasons as to why we would want to allow that).
When I had done my initial research online, it didn't look like clang-format
could do anything about ensuring a comment on the #endif
et al., which I thought was odd because it can add comments at the end of a namespace; I guess it doesn't process preprocessor directives beyond just expansion.
A script as part of the CI to check that all modified files in a PR do not have a commentless $endif
et al. is probably reasonable to do; it'd just be another linter job.
Overview
In the August 1, 2024 OMR Architecture Meeting, we discussed changing the code formatting style of the compiler component. A prior effort involved an attempt to format the entire OMR project, but ultimately that was not a direction the community as a whole agreed on^1. The current attempt to format only the compiler component stems from the fact that original effort was led by the compiler component, and it also had the largest buy-in.
The main ongoing benefit of this effort, aside from the fact that it will yield code that is formatted in a style that is generally satisfactory to the majority of the compiler developers, is the ability to run tools such as linters that can be used to enforce the coding style; this reduces the burden on reviewers allowing them to focus purely on the functional aspects of a pull request. Targeting only the compiler component also has the added benefit of providing a "proof of concept" that other components can use to re-evaluate their position; the barrier to entry would also be lower as the majority of the infrastructure would already be in place.
The outcome of the August 1st OMR meeting is the following set of actions:
.clang-format
file to the compiler component,The proposed code format is the style that was accepted in the previous attempt, namely style no. 6 which is based on https://github.com/eclipse/omr/pull/3390.
To compare the code with and without this style applied, please see:
Plan
clang-format
(probably the latest stable version)clang-format
on the machine that runs the Line Endings jobScripts
Fix up multiple namespace declarations:
pre-commit git hook:
Rebase helper (though this may not work for multiple commits changing the same file):
Azure pipeline code format checker: