eclipse-omr / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
950 stars 397 forks source link

Format the code in the Compiler component using clang-format #7434

Open dsouzai opened 3 months ago

dsouzai commented 3 months ago

Overview

In the August 1, 2024 OMR Architecture Meeting, we discussed changing the code formatting style of the compiler component. A prior effort involved an attempt to format the entire OMR project, but ultimately that was not a direction the community as a whole agreed on^1. The current attempt to format only the compiler component stems from the fact that original effort was led by the compiler component, and it also had the largest buy-in.

The main ongoing benefit of this effort, aside from the fact that it will yield code that is formatted in a style that is generally satisfactory to the majority of the compiler developers, is the ability to run tools such as linters that can be used to enforce the coding style; this reduces the burden on reviewers allowing them to focus purely on the functional aspects of a pull request. Targeting only the compiler component also has the added benefit of providing a "proof of concept" that other components can use to re-evaluate their position; the barrier to entry would also be lower as the majority of the infrastructure would already be in place.

The outcome of the August 1st OMR meeting is the following set of actions:

  1. Create an issue to outline the plan and discuss the proposed code format (this issue).
  2. Inform downstream projects of this plan.
  3. Open a pull request that reformats the entire compiler component; the PR should also:
    • address the concern Daryl brought up
    • add the linter job,
    • add the .clang-format file to the compiler component,
    • add a script to help developers rebase their changes once the code format PR lands, and
    • provide information on how to add git hooks for developers so that it automatically formats code in the compiler component

The proposed code format is the style that was accepted in the previous attempt, namely style no. 6 which is based on https://github.com/eclipse/omr/pull/3390.

To compare the code with and without this style applied, please see:

  1. Current Style
  2. Proposed Style

Plan

  1. Decide on the version of clang-format (probably the latest stable version)
  2. Install clang-format on the machine that runs the Line Endings job
  3. Open PR with code formatted (which may or may not include the namespace fixup); include githook script.
  4. Inform downstream communities about the PR
  5. Once PR is merged, developers can use the rebase helper to update their commits.

Scripts

Fix up multiple namespace declarations:

#!/bin/bash

files=$(find ./compiler -iname '*.h' -o -iname '*.hpp' -o -iname '*.cpp')

echo $files | xargs perl -0777 -i -pe 's/((namespace OMR {[\S \t]+}\n)+)/namespace OMR {\n$1}\n/g'
echo $files | xargs perl -0777 -i -pe 's/((namespace TR {[\S \t]+}\n)+)/namespace TR {\n$1}\n/g'
echo $files | xargs perl -i -pe 's/namespace (OMR|TR) {[ \t]*([\S ]*\S)[ \t]*}/$2/g'

pre-commit git hook:

#!/usr/bin/bash

for FILE in $(git diff --cached --name-only)
do
  if [[ $FILE == compiler/*.cpp ]] || [[ $FILE == compiler/*.h ]] || [[ $FILE == compiler/*.hpp ]] ; then
    clang-format -i -style=file:./compiler/.clang-format $FILE
    git add $FILE
  fi
done

Rebase helper (though this may not work for multiple commits changing the same file):

#!/bin/bash

if [ -z "$1" ] ; then
  echo "Specify the number of commits to prepare for rebase."
  exit
fi

if [ -z $(command -v clang-format) ] ; then
  echo "clang-format is not installed!"
  exit
fi

curl https://raw.githubusercontent.com/eclipse/omr/master/compiler/.clang-format -o /tmp/omr-clang-format
cat <<RCFH > /tmp/rebase_clang_format_helper.sh
#!/bin/bash

for FILE in \$(git diff --name-only HEAD~ HEAD)
do
  if [[ \$FILE == compiler/*.cpp ]] || [[ \$FILE == compiler/*.h ]] || [[ \$FILE == compiler/*.hpp ]] ; then
    clang-format -i -style=file:/tmp/omr-clang-format \$FILE
  fi
done
if [ "\$(git diff --name-only)" != "" ] ; then
  git commit -a --fixup \$(git rev-parse HEAD)
fi

true
RCFH
chmod u+rx /tmp/rebase_clang_format_helper.sh

rebaseSHA=$(git rev-parse HEAD~$1)
git rebase -x "/tmp/rebase_clang_format_helper.sh" $rebaseSHA
GIT_EDITOR=true git rebase -i --autosquash $rebaseSHA

rm /tmp/omr-clang-format
rm /tmp/rebase_clang_format_helper.sh

Azure pipeline code format checker:

jobs:
  - job:
  displayName: 'Check Code Format'
  pool:
    vmImage: 'ubuntu-latest'
  steps:
    - script: |
      allFiles=`git diff -C --diff-filter=ACM --name-only origin/master HEAD --`
      if [ x"$allFiles" = x ] ; then
        echo "There are no files to check for code formatting."
      else
        badFiles=
        for file in $allFiles ; do
          if [[ $file == compiler/*.cpp ]] || [[ $file == compiler/*.h ]] || [[ $file == compiler/*.hpp ]] ; then
            clang-format -i -style=file:./compiler/.clang-format $file
            if [ "$(git diff --name-only HEAD)" != "" ] ; then
              echo "ERROR - clang-format should be a NOP: '$file'"
              badFiles="$badFiles $file"
              git checkout -- $file
            fi
          fi
        done
        hashes='###################################'
        if [ x"$badFiles" != x ] ; then
          echo "$hashes"
          echo "The following files were modified and have incorrect code formatting:"
          for file in $badFiles ; do
            echo "$file"
          done
          echo "$hashes"
          exit 1
        else
          echo "All modified files appear to have correct code formatting."
        fi
      fi
    displayName: 'Check modified files'
0xdaryl commented 3 months ago

Thanks for undertaking this. The formatting looks generally fine to me, and the visual delta between what we have now and the new format doesn't really stand out (which is a good thing, in my opinion).

A few initial questions: 1) The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:

    TR::Block* block = NULL;
    TR::Block *firstColdBlock = NULL, *firstColdExtendedBlock = NULL;

where the pointer definitions differ in the placement of the *. There will likely be a number of other inconsistencies that we encounter as we go through the code base.

2) Can anything be done about the Doxygen comments to make them more consistently formatted? The project should pick a style and apply it throughout (e.g., @brief vs \brief, and the formatting of text within the comment)

3) The formatting of comments in general seems inconsistent (at least in terms of line length). Was that intentional, and can it be changed? If it could be changed, we will have to be a lot more diligent reviewing the changes to be sure that comments that shouldn't wrap don't to keep the formatting correct.

ThanHenderson commented 3 months ago

@0xdaryl

  1. The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:

In situations like this, I think we could either 1) change clangformat to right align all the *, or 2) just fix these as we see them in future. We could also then have a blurb in the coding standard to declare only one variable of a pointer type on single line. e.g.

    TR::Block* block = NULL;
    TR::Block* firstColdBlock = NULL;
    TR::Block* firstColdExtendedBlock = NULL;

@dsouzai your Fix up multiple namespace declarations script inserts a bunch of white spaces after the newlines.

hzongaro commented 3 months ago

The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:

Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.

if (condition)
    statement;
else
    statement2;

versus

if (condition) {
    statement;
} else {
    statement2;
}

and

    switch (TR::Compiler->om.writeBarrierType()) {
    case gc_modron_wrtbar_oldcheck:
    case gc_modron_wrtbar_cardmark:
    case gc_modron_wrtbar_cardmark_and_oldcheck:
    case gc_modron_wrtbar_cardmark_incremental:
        break;
    default:
        needWriteBarrier = false;
        break; 
    }

versus

    switch (TR::Compiler->om.writeBarrierType()) {
    case gc_modron_wrtbar_oldcheck:
    case gc_modron_wrtbar_cardmark:
    case gc_modron_wrtbar_cardmark_and_oldcheck:
    case gc_modron_wrtbar_cardmark_incremental: {
        break;
    }
    default: {
        needWriteBarrier = false;
        break; 
    }
    }
ThanHenderson commented 3 months ago

Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.

For if, else, for, do, while clang-format 15+ has an InsertBraces option that should address this. For the switch cases, I am not aware of any tool that does that, it'd need a deeper understanding of the intent of the code since the braces on the case are only added for scoping variables defined within the case and I doubt we'd want to enforce those braces all of the time.

While we are here, maybe we could also explore the clang-tidy static analysis tool in addition to clang-format. (list of clang-tidy checks)

hzongaro commented 3 months ago

the braces on the case are only added for scoping variables defined within the case and I doubt we'd want to enforce those braces all of the time.

I guess that's a matter of taste. I prefer to think of each case as a block of code and like that to be emphasized - hence, I almost always put braces around the blocks of code in the cases. If most others prefer not to do that, that's fine and I can live with that, but I think we should have a consistent approach. Both styles appear in the compiler today, even in the absence of variable declarations - though I don't know how many of the braces without variable declarations might have been introduced by me. :-)

ThanHenderson commented 3 months ago

I guess that's a matter of taste.

I also prefer the brace scopes for eachcase. The issue just may be -- if there was some automated way to apply these scopes -- that some code exists that doesn't have the scopes and defines a variable that is used in multiple cases below that case. In this situation, the semantics would break. I doubt that this is actually done anywhere -- or many places -- in the code base though.

I don't know how many of the braces without variable declarations might have been introduced by me. :-)

No harm here!

I think this is another example of how the coding standard should be updated to state that all case blocks that are not solely fallthrough cases should be brace-scoped. And if a variable needs to be read or written in multiple cases, it should be defined in a parent scope.

0xdaryl commented 3 months ago

I think one of the goals of this exercise is to apply consistency in formatting rules throughout the code base. Each of us has different perspectives for how we prefer things to look (and we'll likely never reach consensus on), but I think we will all agree that regardless of the style chosen it should be applied consistently throughout the code base [1].

Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style. @dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?

dsouzai commented 3 months ago

@dsouzai your Fix up multiple namespace declarations script inserts a bunch of white spaces after the newlines.

@ThanHenderson Yeah I had noticed that, but then I figured since the namespace merging would happen in the same PR as the code formatting, it would get washed away as part of the formatting. That said, I realized that the regex was relatively straightforward to fix, so I updated it to remove spaces before and after.

@dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?

I haven't spent too much time with this tool aside from the work needed to create this issue; I mostly just piggy-backed off the work of others in the past. I'll have to read up the documentation and play around with it more to see if I can answer some of the questions brought up above.

My initial guess though is that we likely won't be able to smooth out absolutely everything in one go, and that some things will need some manual intervention on an ongoing basis. However, that shouldn't prevent us from going ahead with this for a couple of reasons

  1. The code base already contains inconsistencies; any inconsistencies that appear/remain should be minimal.
  2. This makes the code format data driven and thus enables automation both on the developer's side (via git hooks / IDE plugins) and on the project side (via the autochecker).

I'll play around with it though and see. If anyone else also has experience with clang-format, I'd appreciate your insights.

dsouzai commented 3 months ago

Heh, I should mention, since I only looked at the diff from https://github.com/eclipse/omr/pull/3390, I missed this bit in the clang format file that may have impacted how the code looks:

Language: ObjC

I think the webkit file has expanded since Robert looked at it way back in 2018. Once I go through all the various config options, I'll run the formatter again and see how it looks.

ThanHenderson commented 3 months ago

Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style

The syntax inconsistencies should be minimal after clang-format is applied; provided we specify all of the rules that we care about in the .clang-format file.

I think the important thing to do here is identify exactly the inconsistencies that we find, and devise the appropriate rules for clang-format in light of them. Where there any other inconsistencies that people noticed?

For example, though it may not be everyone's preference, going with option 2 here should eliminate all inconsistencies with pointer types.

Certain other inconsistencies, like the braced-case blocks the Henry mentioned, are not only syntax formatting problems; a change to the syntax may change the semantics. A formatter cannot handle those cases and thus the coding standard will need to be more specific about the opinions there.

w.r.t. the comment formatting, clang-format has some comment formatting options and general options that also apply to comments. However, they are pretty basic.

ThanHenderson commented 3 months ago

Heh, I should mention, since I only looked at the diff from https://github.com/eclipse/omr/pull/3390, I missed this bit in the clang format file that may have impacted how the code looks:

That shouldn't affect it since that section was separated by a ---.

dsouzai commented 3 months ago

I went through the clang-format documentation, and played around with the config a bit. You can look at the latest branch https://github.com/dsouzai/omr/tree/clangformat_20240813. I'm using

clang-format version 18.1.8

as it is the latest stable release.

An important fact I came across from reading online is it is important everyone use the same version of clang-format. Apparently, the output of newer versions of clang-format will not be a strict superset of the output of older versions. Therefore, whatever version of clang-format is chosen will need to be the one we use for good.

I had to do

wget https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
tar -xvf clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
sudo apt-get install libtinfo5

to run clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04/bin/clang-format.

The diff of the clang-format file is now

1,2c1
< ---
< # BasedOnStyle:  WebKit
---
> BasedOnStyle:  WebKit
4a4
> AlignArrayOfStructures: Right
5a6
> AlignConsecutiveBitFields: false
7c8,9
< AlignEscapedNewlines: Right
---
> AlignConsecutiveMacros: false
> AlignEscapedNewlines: Left
9a12
> AllowAllArgumentsOnNextLine: false
12a16
> AllowShortEnumsOnASingleLine: false
14a19
> AllowShortLambdasOnASingleLine: true
19c24
< AlwaysBreakTemplateDeclarations: No
---
> AlwaysBreakTemplateDeclarations: MultiLine
22c27,29
< BraceWrapping:
---
> BitFieldColonSpacing: After
> BraceWrapping:
>   AfterCaseLabel:  false
24c31
<   AfterControlStatement: false
---
>   AfterControlStatement: Never
32a40,41
>   BeforeLambdaBody: false
>   BeforeWhile:     false
34,36c43,48
<   SplitEmptyFunction: true
<   SplitEmptyRecord: true
<   SplitEmptyNamespace: true
---
>   SplitEmptyFunction: false
>   SplitEmptyRecord: false
>   SplitEmptyNamespace: false
> BreakAdjacentStringLiterals: true
> BreakAfterAttributes: Never
> BreakArrays: false
38,39c50,51
< BreakBeforeBraces: WebKit
< BreakBeforeInheritanceComma: false
---
> BreakBeforeBraces: Custom
> BreakBeforeInlineASMColon: Never
41d52
< BreakConstructorInitializersBeforeComma: false
43c54
< BreakAfterJavaFieldAnnotations: false
---
> BreakInheritanceList: BeforeComma
45,48c56,58
< ColumnLimit:     0
< CommentPragmas:  '^ IWYU pragma:'
< CompactNamespaces: false
< ConstructorInitializerAllOnOneLineOrOnePerLine: false
---
> ColumnLimit:     120
> CommentPragmas:  '^ IWYU pragma:|SPDX-License-Identifier:'
> CompactNamespaces: true
53a64,65
> EmptyLineAfterAccessModifier: Never
> EmptyLineBeforeAccessModifier: LogicalBlock
55,56c67,68
< FixNamespaceComments: false
< ForEachMacros:
---
> FixNamespaceComments: true
> ForEachMacros:
59a72
> IncludeBlocks: Preserve
73c86,91
< IndentCaseLabels: false
---
> IndentAccessModifiers: false
> IndentCaseBlocks: false
> IndentCaseLabels: true
> IndentExternBlock: NoIndent
> IndentGotoLabels: false
> IndentPPDirectives: None
76,78c94,99
< JavaScriptQuotes: Leave
< JavaScriptWrapImports: true
< KeepEmptyLinesAtTheStartOfBlocks: true
---
> InsertBraces: false
> InsertNewlineAtEOF: true
> KeepEmptyLinesAtEOF: true
> KeepEmptyLinesAtTheStartOfBlocks: false
> Language: Cpp
> LineEnding: LF
83,85c104,105
< ObjCBlockIndentWidth: 4
< ObjCSpaceAfterProperty: true
< ObjCSpaceBeforeProtocolList: true
---
> PPIndentWidth: 0
> PackConstructorInitializers: Never
93,95c113,121
< PointerAlignment: Left
< ReflowComments:  true
< SortIncludes:    true
---
> PointerAlignment: Right
> QualifierAlignment: Leave
> ReferenceAlignment: Right
> ReflowComments: true
> RemoveBracesLLVM: false
> RemoveParentheses: Leave
> RemoveSemicolon: false
> SeparateDefinitionBlocks: Always
> SortIncludes:    false
97a124
> SpaceAfterLogicalNot: false
98a126
> SpaceAroundPointerQualifiers: Before
99a128
> SpaceBeforeCaseColon: false
100a130,131
> SpaceBeforeCtorInitializerColon: true
> SpaceBeforeInheritanceColon: true
101a133,134
> SpaceBeforeSquareBrackets: false
> SpaceInEmptyBlock: false
106a140,142
> SpacesInLineCommentPrefix:
>   Minimum: 1
>   Maximum: -1
109c145
< Standard:        Cpp11
---
> Standard:        c++03
112,115d147
< ---
< Language: ObjC
< PointerAlignment: Right
< ...
dsouzai commented 3 months ago
  1. The formatting doesn't make everything consistent. Can we apply custom rules based on the minor things that we find? For example, there is code like this:
    TR::Block* block = NULL;
    TR::Block *firstColdBlock = NULL, *firstColdExtendedBlock = NULL;

where the pointer definitions differ in the placement of the *. There will likely be a number of other inconsistencies that we encounter as we go through the code base.

After looking at the docs, as @ThanHenderson said, using

PointerAlignment: Right

is the easiest solution for this.

dsouzai commented 3 months ago

2. Can anything be done about the Doxygen comments to make them more consistently formatted? The project should pick a style and apply it throughout (e.g., @brief vs \brief, and the formatting of text within the comment)

It doesn't look like clang-format supports anything like that. However, doing

$ find ./compiler -iname '*.h' -o -iname '*.hpp' -o -iname '*.cpp' | xargs perl -ne 'print "$1\n" if /(\\\S+)/g' | grep -v '\\\(n\|t\|x\|"\|/\|0\|_\||\|\\\|#\|@\)' | sort -u | uniq
\TODO:
\a
\attention
\b
\brief
\c
\class
\code
\def
\deprecated
\details
\em
\endcode
\endverbatim
\example
\file
\ingroup
\p
\page
\par
\param
\param[in,out]
\param[in]
\param[out]
\parm
\pre
\ref
\return
\returns
\section
\see
\structure
\varbatim
\verbatim
\warning

and

$ find ./compiler -iname '*.h' -o -iname '*.hpp' -o -iname '*.cpp' | xargs perl -ne 'print "$1\n" if /(@\S+)/g' | grep -v '@\(@\|%d\|%p\|[[:digit:]]\)' | sort -u | uniq
@</tt>
@ASSERT
@ForceInline",
@Stable",
@TODO:
@arm.com/
@brief
@brief:
@c
@class
@code
@ddr_namespace:
@deprecated
@details
@endcode
@endverbatim
@exit
@file
@id
@name
@name:
@note
@note:
@p
@param
@param[In]
@param[Out]
@param[in,out]
@param[in/out]
@param[in]
@param[in]:
@param[inout]
@param[out]
@param[rn]
@parm
@ref
@return
@return:
@returns
@sa
@see
@struct
@todo
@tparam
@tparm
@tt:
@verbatim
@{
@}

shows a handful of doxygen keywords in the compiler dir. It should be fairly straightforward to do a search and replace of either the \<keyword> or @<keyword>

dsouzai commented 3 months ago

3. The formatting of comments in general seems inconsistent (at least in terms of line length). Was that intentional, and can it be changed? If it could be changed, we will have to be a lot more diligent reviewing the changes to be sure that comments that shouldn't wrap don't to keep the formatting correct.

I believe even the original clang-format config file enabled reflowing comments. According to the doc, the reflow respects the columnLimit, which is set to 120. As such, the comments should be consistent at least wrt to the column limit. Now, we could lower the columnLimit to say 80 or something like that, but I was just using what was in https://github.com/eclipse/omr/pull/3390.

Do you have an example of the inconsistency?

dsouzai commented 3 months ago

Similarly, I noticed that parentheses and braces were preserved as is. Existing code is inconsistent in the use of braces around single statements in if/else, and in the cases of switch. I think it would be good if consistency was enforced.

Since clang-format v15, you can specify InsertBraces:

Insert braces after control statements (if, else, for, do, and while) in C++ unless the control statements are inside macro definitions or the braces would enclose preprocessor directives.

However, it comes with the warning

Setting this option to true could lead to incorrect code formatting due to clang-format’s lack of complete semantic information. As such, extra care should be taken to review code changes made by this option.

So it's not something that can be done when applying to a big chunk of code as we're proposing to do.

dsouzai commented 3 months ago

Even after the auto formatting has been applied there appear to be examples where the formatting is inconsistent for different parts of the language syntax or even the comment style. @dsouzai : based on your experience with this tool, do you think it is feasible/reasonable for all of these formatting inconsistencies to be smoothed out in the initial, single commit?

After looking at the docs and online, all the inconsistencies brought up in this issue thus far can't be addressed in one go; it'll have to be something that we clean up as we come across them. Really, only the pointer alignment issue can be addressed.

ThanHenderson commented 3 months ago

However, it comes with the warning Setting this option to true could lead to incorrect code formatting due to clang-format’s lack of complete semantic information. As such, extra care should be taken to review code changes made by this option. So it's not something that can be done when applying to a big chunk of code as we're proposing to do.

InsertBraces was pulled from clang-tidy; we could explore clang-tidy too for things that involve semantics and not just white space.

dylanjtuttle commented 3 weeks ago

Prompted by a discussion with @hzongaro on a PR I opened yesterday, Would it be possible to add a rule which enforces comments documenting conditional directives as outlined in the OMR C and C++ Coding Standards? I know the coding standards are more of a loose guideline throughout many components of the codebase, but these comments do appear frequently in both OMR and OpenJ9, and I personally find them very useful for readability. For example:

#if defined(A) || defined(B)
...
#elif !defined(C) /* defined(A) || defined(B) */
...
#else /* !defined(C) */
...
#endif /* defined(A) || defined(B) */

I'm working on a Python script that does exactly this and so far it appears to be feasible to do programmatically. I was intending to use it to do a one time sweeping change to introduce missing comments or fix incorrect ones, but maintaining these comments would be sort of inconvenient without a formatting tool like the one we're planning on using here. On the other hand, from some cursory research, it doesn't really seem like the kind of thing that clang-format does. Perhaps those who are more familiar with clang-format know whether this sort of thing is possible.

ThanHenderson commented 3 weeks ago

I think it's a good idea in principle. I'm not familiar with a way to do this via clang-format, and I don't imagine there would be one. A nice stack-based approach for adding in the comments may suffice for the initial pass, and taking care to fix anything suspicious.

Once that initial pass is done, for a CI solution we could scan for commentless #elif, #else, and #endif directives and emit a warning if they are bare, or just run the script to fix it up automatically. The problem with the latter I guess is if we want to allow exceptions to this rule (though I can't think of reasons as to why we would want to allow that).

dsouzai commented 3 weeks ago

When I had done my initial research online, it didn't look like clang-format could do anything about ensuring a comment on the #endif et al., which I thought was odd because it can add comments at the end of a namespace; I guess it doesn't process preprocessor directives beyond just expansion.

A script as part of the CI to check that all modified files in a PR do not have a commentless $endif et al. is probably reasonable to do; it'd just be another linter job.