cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

Test failures due to floating point precision issues on ppc64le #72226

Closed amitsadaphule closed 2 years ago

amitsadaphule commented 3 years ago

Describe the problem

I built the cockroachdb v20.2.17 on ppc64le and when I executed the test suite, I found that there were test failures for the following packages:

github.com/cockroachdb/cockroach/pkg/geo
github.com/cockroachdb/cockroach/pkg/geo/geogfn
github.com/cockroachdb/cockroach/pkg/geo/geographiclib
github.com/cockroachdb/cockroach/pkg/geo/geoindex
github.com/cockroachdb/cockroach/pkg/sql/sem/tree
github.com/cockroachdb/cockroach/pkg/sql/sem/tree/eval_test

When I analysed the logs FP-precision-failures.zip, I found that the failing tests were due to floating point precision issues. A sample log of test failure for geogfn package pasted below (Please refer full package test logs in attached zip for more details):

--- FAIL: TestAzimuth/north_east (0.00s)
        azimuth_test.go:70: 
                Error Trace:    azimuth_test.go:70
                Error:          Not equal: 
                                expected: 0.7886800845259658
                                actual  : 0.788680084525966
                Test:           TestAzimuth/north_east

--- FAIL: TestSegmentize/LINESTRING(1.0_1.0,_2.0_2.0,_3.0_3.0),_maximum_segment_length:_100000.000000 (0.00s)
    segmentize_test.go:118: 
            Error Trace:    segmentize_test.go:118
            Error:          Not equal: 
                            expected: geo.Geography{spatialObject:geopb.SpatialObject{Type:1, EWKB:geopb.EWKB{0x1, 0x2, 0x0, 0x0, 0x20, 0xe6, 0x10, 0x0, 0x0, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0xed, 0x4e, 0xa4, 0x2f, 0x88, 0xff, 0xf7, 0x3f, 0x97, 0x93, 0x60, 0xdd, 0x3b, 0x0, 0xf8, 0x3f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0x9d, 0xc8, 0x4f, 0x1d, 0x9c, 0xff, 0x3, 0x40, 0x91, 0xbb, 0xc5, 0xd8, 0x31, 0x0, 0x4, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0x40}, SRID:4326, ShapeType:2, BoundingBox:(*geopb.BoundingBox)(0xc000270560)}}
                            actual  : geo.Geography{spatialObject:geopb.SpatialObject{Type:1, EWKB:geopb.EWKB{0x1, 0x2, 0x0, 0x0, 0x20, 0xe6, 0x10, 0x0, 0x0, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0xed, 0x4e, 0xa4, 0x2f, 0x88, 0xff, 0xf7, 0x3f, 0x96, 0x93, 0x60, 0xdd, 0x3b, 0x0, 0xf8, 0x3f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0x9c, 0xc8, 0x4f, 0x1d, 0x9c, 0xff, 0x3, 0x40, 0x91, 0xbb, 0xc5, 0xd8, 0x31, 0x0, 0x4, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0x40}, SRID:4326, ShapeType:2, BoundingBox:(*geopb.BoundingBox)(0xc000270540)}}

                            Diff:
                            --- Expected
                            +++ Actual
                            @@ -6,4 +6,4 @@
                                00000010  00 00 00 f0 3f 00 00 00  00 00 00 f0 3f ed 4e a4  |....?.......?.N.|
                            -   00000020  2f 88 ff f7 3f 97 93 60  dd 3b 00 f8 3f 00 00 00  |/...?..`.;..?...|
                            -   00000030  00 00 00 00 40 00 00 00  00 00 00 00 40 9d c8 4f  |....@.......@..O|
                            +   00000020  2f 88 ff f7 3f 96 93 60  dd 3b 00 f8 3f 00 00 00  |/...?..`.;..?...|
                            +   00000030  00 00 00 00 40 00 00 00  00 00 00 00 40 9c c8 4f  |....@.......@..O|
                                00000040  1d 9c ff 03 40 91 bb c5  d8 31 00 04 40 00 00 00  |....@....1..@...|
            Test:           TestSegmentize/LINESTRING(1.0_1.0,_2.0_2.0,_3.0_3.0),_maximum_segment_length:_100000.000000

To Reproduce

Here are the steps to reproduce the issue:

#!/bin/bash

CWD=`pwd`

# Install all dependencies
dnf -y --disableplugin=subscription-manager install \
        http://mirror.centos.org/centos/8/BaseOS/ppc64le/os/Packages/centos-gpg-keys-8-2.el8.noarch.rpm \
        http://mirror.centos.org/centos/8/BaseOS/ppc64le/os/Packages/centos-linux-repos-8-2.el8.noarch.rpm \
        https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

yum install -y git cmake make gcc-c++ autoconf ncurses-devel.ppc64le wget.ppc64le openssl-devel.ppc64le diffutils procps-ng

cd $HOME

set -eu

# Install nodejs
NODE_VERSION=v12.18.2
DISTRO=linux-ppc64le
wget https://nodejs.org/dist/$NODE_VERSION/node-$NODE_VERSION-$DISTRO.tar.gz
tar -xzf node-$NODE_VERSION-$DISTRO.tar.gz
export PATH=$HOME/node-$NODE_VERSION-$DISTRO/bin:$PATH

npm install yarn --global

cd $HOME

# Setup go environment and install go
GOPATH=$HOME/go
COCKROACH_HOME=$GOPATH/src/github.com/cockroachdb
mkdir -p $COCKROACH_HOME
export GOPATH
curl -O https://dl.google.com/go/go1.13.5.linux-ppc64le.tar.gz
tar -C /usr/local -xzf go1.13.5.linux-ppc64le.tar.gz
rm -rf go1.13.5.linux-ppc64le.tar.gz
export PATH=$PATH:/usr/local/go/bin

# Clone cockroach and build
COCKROACH_VERSION=v20.2.17
cd $COCKROACH_HOME
git clone https://github.com/cockroachdb/cockroach.git
cd cockroach
git checkout $COCKROACH_VERSION
make buildoss

export GOMAXPROCS=4
make test PKG=./pkg/geo TESTFLAGS='-v -count=1'
make test PKG=./pkg/geo/geogfn TESTFLAGS='-v -count=1'
make test PKG=./pkg/geo/geographiclib TESTFLAGS='-v -count=1'
make test PKG=./pkg/geo/geoindex TESTFLAGS='-v -count=1'
make test PKG=./pkg/sql/sem/tree TESTFLAGS='-v -count=1'
make test PKG=./pkg/sql/sem/tree/eval_test TESTFLAGS='-v -count=1'

Environment:

CockroachDB [v20.2.17] Architecture and OS: [ppc64le/RHEL/UBI 8.4] Go: [1.13.5] Node: [12.18.2]

Any pointers to resolve the failures would be of great help.

Jira issue: CRDB-11048

blathers-crl[bot] commented 3 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

blathers-crl[bot] commented 3 years ago

cc @cockroachdb/bulk-io

knz commented 3 years ago

@amitsadaphule this is the result of a different floating-point behavior on PPC.

As explained in #72227, CockroachDB's team does not support alternative build platforms, so we're likely not going to work actively to resolve this particular problem. However, on the other hand we have many users that report that cockroachDB works on their platform, even if not officially supported.

Note that the difference in behavior wrt floating-point computations means that SQL queries are likely to return slightly differnt results than on other platforms. However, from experience the difference is often very small and depending on the application might be negligible.

Also, in the future please do not file common issues for multiple go packages. We would prefer different issues filed for different areas. Thank you.

seth-priya commented 3 years ago

Thanks @knz for your response again and point taken.

Just a quick confirmation/clarification on your statement above "this is the result of a different floating-point behavior on PPC."

I saw probably similar references at a couple of additional places today, basically https://github.com/gohugoio/hugo/issues/6387#issuecomment-538559712 https://github.com/disintegration/gift/issues/20#issuecomment-524551797

which essentially say that The Go compiler now implements a fused multiply and add (FMA) instruction on some architectures, which leads to floating-point rounding differences compared to amd64. is this what you are pointing to as well?

knz commented 3 years ago

I wasn't aware about the FMA behavior in the go compiler, but yet it could contribute to the situation as well.

Note that even without FMA, there are subtle differences in the ordering of FP operations inside the micro-architecture that can cause rounding differences.

prashantkhoje commented 2 years ago

The geo failures are fixed by https://github.com/cockroachdb/cockroach/pull/81734.

The geogfn and geomfn failures are fixed by: https://github.com/cockroachdb/cockroach/pull/81894 https://github.com/cockroachdb/cockroach/pull/82250

One of the geoindex failure is fixed by https://github.com/cockroachdb/cockroach/pull/82456.

Following failures are not reproducible anymore:

prashantkhoje commented 2 years ago

Hello @otan, How are the fixes in vendored code handled? I've a solution for test failures in pkg/geo/geomfn. It'll be submitted to the original source repository. However, it'll take some time to get merged and then be synchronized in vendored repo. Is it allowed to commit changes in vendored repo directly?

Thank you, Prashant

otan commented 2 years ago

if you let me know which change i can let you know further

prashantkhoje commented 2 years ago

I need to force floating-point type conversion in twpayne/go-geom/xy/lineintersector/nonrobust_line_intersector.go to avoid FMA on ppc64le.

prashantkhoje commented 2 years ago

The go-geom v1.4.2 has the required fix in nonrobust_line_intersector.go. The PR #214 resolves geomfn/TestShortestLineString test failure on ppc64le. The PR #213 fixes test failures in go-geom.

otan commented 2 years ago

gotcha, sorry just came back from holiday. updating vendor here: https://github.com/cockroachdb/cockroach/pull/86126