antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.24k stars 3.71k forks source link

Problems with SQL reformatting #4047

Closed dpmm99 closed 6 months ago

dpmm99 commented 7 months ago

https://github.com/antlr/grammars-v4/blob/753536777d827ccc0c9b108531ea67375c2039ac/sql/tsql/TSqlLexer.g4#L1292

It looks like some information was lost because of the automatic formatting. For example, at the end of this file:

    | '\uff00'..'\ufff0'
    // | '\u10000'..'\u1F9FF'  //not support four bytes chars
    // | '\u20000'..'\u2FA1F'
    ;

turned into

    | '\uff00' ..'\ufff0'
    ; // | '\u20000'..'\u2FA1F'

Someone might want to compare that commit with its parent with all whitespace stripped to find other cases of lost comments and fix the tool so it doesn't remove comments.

To begin with, automatic formatting may not have been the best idea considering there were deliberate decisions made for things like this bit from the T-SQL parse grammar:

dateparts_12
    : dateparts_9
    | DAYOFYEAR | DAYOFYEAR_ABBR
    | MICROSECOND | MICROSECOND_ABBR
    | NANOSECOND | NANOSECOND_ABBR
kaby76 commented 7 months ago

I wrote a Trash script to compare the parse trees of the grammars before and after reformatting (minus all intertoken attributes). There are no grammar differences. So, that much is fine.

#
set -x
set -e
before=6e78e1872264ca4b78bb89243ad005db102cf3c9
after=753536777d827ccc0c9b108531ea67375c2039ac
prefix=`pwd`
git checkout $before
directories=`find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    trparse -t ANTLRv4 *.g4 | trdelete ' //@*' | trtree > before.txt
    popd > /dev/null 2>&1
done

git checkout $after
directories=`find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    trparse -t ANTLRv4 *.g4 | trdelete ' //@*' | trtree > after.txt
    popd > /dev/null 2>&1
done

for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    diff before.txt after.txt || true
    popd > /dev/null 2>&1
done

I don't have a script to check comments yet, but it looks like the reformat should not have done reflow of comments. That means I can "grep" the comments and compare what is missing after reformatting.

kaby76 commented 7 months ago

I had to change the Trash parse tool to create attributes named after the token type (https://github.com/kaby76/Domemtech.Trash/issues/434). Antlr4 grammars have three types of comments, so the trxgrep looks for DOC_COMMENT, BLOCK_COMMENT, and LINE_COMMENT. After grepping for comments, I removed the lines containing antlr-format as these were added by the reformatter.

#
# set -x
# set -e
before=6e78e1872264ca4b78bb89243ad005db102cf3c9
after=753536777d827ccc0c9b108531ea67375c2039ac
prefix=`pwd`
git checkout $before
directories=`find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    trparse -t ANTLRv4 *.g4 | trxgrep --no-prs ' //(@DOC_COMMENT | @BLOCK_COMMENT | @LINE_COMMENT)' | grep -v antlr-format > before.txt
    dos2unix before.txt
    popd > /dev/null 2>&1
done

git checkout $after
directories=`find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    trparse -t ANTLRv4 *.g4 | trxgrep --no-prs ' //(@DOC_COMMENT | @BLOCK_COMMENT | @LINE_COMMENT)' | grep -v antlr-format > after.txt
    dos2unix after.txt
    popd > /dev/null 2>&1
done

for g in $directories
do
    echo $g
    pushd $g > /dev/null 2>&1
    g=`pwd`
    g=${g##*$prefix/}
    diff before.txt after.txt
    if [ "$?" != "0" ]
    then
    echo $g has diffs.
    fi
    popd > /dev/null 2>&1
done

Indeed, we now see a collection of differences in comments from the formatter. These grammars will all need to be fixed.

haskell
sql/derby
sql/tsql