TPC-Council / HammerDB

HammerDB Database Load Testing and Benchmarking Tool
http://www.hammerdb.com
GNU General Public License v3.0
589 stars 119 forks source link

Network Error When Loading Data #618

Closed JoshInnis closed 12 months ago

JoshInnis commented 1 year ago

Guidance Note: The exact of the codebase we are using is the 4.8 RHEL8 release, with the modifications to MSSQL added. However, we currently believe that the bug is independent of those changes, because we are experiencing this issue with Postgres too and on codepaths that do not use the new changes.

Describe the bug There is a network error that is corrupting communication. There error is reporting an issue when talking to localhost, so we do not believe this has any issue with the server itself.

We believe the error is called in modules/comm-4.5.6.tm at line 913.

We are not using the agent

To Reproduce Steps to reproduce the behavior:

  1. Setup HammerDB (ours is setup in an Azure VM with RedHat 8)
  2. Run a load (the larger the load the higher likelihood the error occurs.) We have tested with Postgres and MSSQL Server (both with and without BCP enabled)

Expected behavior The load completes with no errors.

Screenshots image

There are many lines of jumbled characters and then:

image

Scripts Used pg_build.sh

#!/bin/bash
export USERS=32
export SCALE=300
export DB=tpch_300_v32

date >& pg_tproch_build_sql_${SCALE}_${USERS}_${DB}.out
./hammerdbcli auto ./pg_tproch_buildschema.tcl >> pg_tproch_build_sql_${SCALE}_${USERS}_${DB}.out
date >> pg_tproch_build_sql_${_16vuSCALE}_${USERS}_${DB}.out

pg_tproch_buildschema.tcl is the script included with HammerDB, with altered connection information

HammerDB Version (please complete the following information):

HammerDB Interface (please complete the following information):

Operating System (please complete the following information):

Database Server (please complete the following information):

Database Client (please complete the following information):

sm-shaw commented 12 months ago

The first thing we will need to do is to try and reproduce the error. It should be noted that we have not seen this error reported before, so we will need to find a scenario where we can repeat it.

The comm package is used in HammerDB in 2 scenarios.

  1. Remote CPU metrics to send CPU data on a remote server to HammerDB.
  2. Primary/Replica communication where a Primary instance of HammerDB is used to control 1 or more replicas. (Often used for testing clusters with synchronized clients pointing at different clusters nodes)

An example is as follows:

hammerdb>switchmode primary
Switch from Local
to Primary mode?
Enter yes or no: replied yes
Setting Primary Mode at id : 13563, hostname : CRANE
Primary Mode active at id : 13563, hostname : CRANE

hammerdb>puts [ comm channels ]
::comm::comm ::Primary

hammerdb>Received a new replica connection from host 127.0.0.1
New replica joined : {13966 CRANE}

hammerdb>puts [ comm channels ]
::comm::comm ::Primary

hammerdb>switchmode replica 13563 CRANE
Switch from Local
to Replica mode?
Enter yes or no: replied yes
Setting Replica Mode at id : 13966, hostname : CRANE
Replica connecting to CRANE 13563 : Connection succeeded
Primary call back successful

hammerdb>puts [comm channels ]
::comm::comm ::Replica

So when the comm package opens it opens a default channel called "comm::comm" it looks like "something" external is trying to open a connection to this default channel.

it may be possible to add a hook for when this occurs:

 ::comm::comm hook incoming {
        if {[string match 127.0.0.1 $addr]} {
            error "puts had a connection from $addr"
        }
    }
sm-shaw commented 12 months ago

I have managed to reproduce the error and speculate that this would happen without running a workload either build or test and you would see it if you started the CLI and then left it idle for a period of time.

To reproduce I started Primary mode:

./hammerdbcli
HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (110,592 KB)
hammerdb>switchmode primary
Switch from Local
to Primary mode?
Enter yes or no: replied yes
Setting Primary Mode at id : 13684, hostname : CRANE
Primary Mode active at id : 13684, hostname : CRANE

and checked that the ID is our network port:

root@CRANE:~# netstat -p| grep 13684
root@CRANE:~# netstat -a | grep 13684
tcp        0      0 0.0.0.0:13684 0.0.0.0:*               LISTEN
tcp6       0      0 [::]:13684 [::]:*                  LISTEN

I then used telnet to connect to the port and type some garbage:

# telnet 127.0.0.1 13684
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
sdssddsds
Connection closed by foreign host.

and my Primary connection fails with the unknown offered protocol error:

hammerdb>Received a new replica connection from host 127.0.0.1
Unmatched Background Error - Unknown offered protocols "sdssddsds" from 127.0.0.1/40204
    while executing
"error "Unknown offered protocols \"$protoline\" from $addr/$remport""
    (procedure "::comm::commIncoming" line 36)
    invoked from within
"::comm::commIncoming ::Primary sock55dc1d4866a0 127.0.0.1 40204"

So after a restart, checking the default comm port:

root@CRANE:~# netstat -a | grep 0.0.0.0
tcp        0      0 localhost:34575 0.0.0.0:*               LISTEN

and sending it some garbage:

# telnet 127.0.0.1 34575
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
dddsdfdsfdf
Connection closed by foreign host.

We now get exactly the same error that you get:

./hammerdbcli
HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (110,592 KB)
hammerdb>Unmatched Background Error - Unknown offered protocols "dddsdfdsfdf" from 127.0.0.1/51334
    while executing
"error "Unknown offered protocols \"$protoline\" from $addr/$remport""
    (procedure "::comm::commIncoming" line 36)
    invoked from within
"::comm::commIncoming ::comm::comm sock55a92dd130b0 127.0.0.1 51334"

So the error appears to be twofold:

  1. in your environment, you have something that is trying to connect with random ports, so this should be investigated, maybe using the external port numbers above to see what this is and rectify this behaviour.
  2. HammerDB shouldn't error if an unwanted connection is made - this is an issue with the comm module. So I will look at improving this. We are using version 4.5.6 and there is a newer version 4.7 - so I will see if the behaviour has changed with the upgraded comm release and or add a hook to the default port to only report a warning that something tried to connect.
sm-shaw commented 12 months ago

619 turns this error into a warning, tested on GUI and CLI and also tested CPU metrics and primary/replica modes to ensure not impacted by change.

such a connection will now show a warning such as follows and not error: ``

HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (45,056 KB)
hammerdb>Warning: Connection from 127.0.0.1/49436 received on ::comm::comm with unknown protocol "ddffdfdfd"

netstat -p can be used to identify the application trying to make the connection (in my test case telnet)

JoshInnis commented 12 months ago

I took a look at the comm-4.7 library and it seems this module's internal implementation has been modified, do you think the module should be upgraded to the newest one or is modifying the error to a warning sufficient?

sm-shaw commented 12 months ago

We shouldn't change something unless the benefit is clear. I tested 4.7 but the error was the same so https://github.com/TPC-Council/HammerDB/pull/619 provides the best solution.