Closed JoshInnis closed 12 months ago
The first thing we will need to do is to try and reproduce the error. It should be noted that we have not seen this error reported before, so we will need to find a scenario where we can repeat it.
The comm package is used in HammerDB in 2 scenarios.
An example is as follows:
hammerdb>switchmode primary
Switch from Local
to Primary mode?
Enter yes or no: replied yes
Setting Primary Mode at id : 13563, hostname : CRANE
Primary Mode active at id : 13563, hostname : CRANE
hammerdb>puts [ comm channels ]
::comm::comm ::Primary
hammerdb>Received a new replica connection from host 127.0.0.1
New replica joined : {13966 CRANE}
hammerdb>puts [ comm channels ]
::comm::comm ::Primary
hammerdb>switchmode replica 13563 CRANE
Switch from Local
to Replica mode?
Enter yes or no: replied yes
Setting Replica Mode at id : 13966, hostname : CRANE
Replica connecting to CRANE 13563 : Connection succeeded
Primary call back successful
hammerdb>puts [comm channels ]
::comm::comm ::Replica
So when the comm package opens it opens a default channel called "comm::comm" it looks like "something" external is trying to open a connection to this default channel.
it may be possible to add a hook for when this occurs:
::comm::comm hook incoming {
if {[string match 127.0.0.1 $addr]} {
error "puts had a connection from $addr"
}
}
I have managed to reproduce the error and speculate that this would happen without running a workload either build or test and you would see it if you started the CLI and then left it idle for a period of time.
To reproduce I started Primary mode:
./hammerdbcli
HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (110,592 KB)
hammerdb>switchmode primary
Switch from Local
to Primary mode?
Enter yes or no: replied yes
Setting Primary Mode at id : 13684, hostname : CRANE
Primary Mode active at id : 13684, hostname : CRANE
and checked that the ID is our network port:
root@CRANE:~# netstat -p| grep 13684
root@CRANE:~# netstat -a | grep 13684
tcp 0 0 0.0.0.0:13684 0.0.0.0:* LISTEN
tcp6 0 0 [::]:13684 [::]:* LISTEN
I then used telnet to connect to the port and type some garbage:
# telnet 127.0.0.1 13684
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
sdssddsds
Connection closed by foreign host.
and my Primary connection fails with the unknown offered protocol error:
hammerdb>Received a new replica connection from host 127.0.0.1
Unmatched Background Error - Unknown offered protocols "sdssddsds" from 127.0.0.1/40204
while executing
"error "Unknown offered protocols \"$protoline\" from $addr/$remport""
(procedure "::comm::commIncoming" line 36)
invoked from within
"::comm::commIncoming ::Primary sock55dc1d4866a0 127.0.0.1 40204"
So after a restart, checking the default comm port:
root@CRANE:~# netstat -a | grep 0.0.0.0
tcp 0 0 localhost:34575 0.0.0.0:* LISTEN
and sending it some garbage:
# telnet 127.0.0.1 34575
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
dddsdfdsfdf
Connection closed by foreign host.
We now get exactly the same error that you get:
./hammerdbcli
HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (110,592 KB)
hammerdb>Unmatched Background Error - Unknown offered protocols "dddsdfdsfdf" from 127.0.0.1/51334
while executing
"error "Unknown offered protocols \"$protoline\" from $addr/$remport""
(procedure "::comm::commIncoming" line 36)
invoked from within
"::comm::commIncoming ::comm::comm sock55a92dd130b0 127.0.0.1 51334"
So the error appears to be twofold:
such a connection will now show a warning such as follows and not error: ``
HammerDB CLI v4.8
Copyright (C) 2003-2023 Steve Shaw
Type "help" for a list of commands
Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (45,056 KB)
hammerdb>Warning: Connection from 127.0.0.1/49436 received on ::comm::comm with unknown protocol "ddffdfdfd"
netstat -p can be used to identify the application trying to make the connection (in my test case telnet)
I took a look at the comm-4.7 library and it seems this module's internal implementation has been modified, do you think the module should be upgraded to the newest one or is modifying the error to a warning sufficient?
We shouldn't change something unless the benefit is clear. I tested 4.7 but the error was the same so https://github.com/TPC-Council/HammerDB/pull/619 provides the best solution.
Guidance Note: The exact of the codebase we are using is the 4.8 RHEL8 release, with the modifications to MSSQL added. However, we currently believe that the bug is independent of those changes, because we are experiencing this issue with Postgres too and on codepaths that do not use the new changes.
Describe the bug There is a network error that is corrupting communication. There error is reporting an issue when talking to localhost, so we do not believe this has any issue with the server itself.
We believe the error is called in modules/comm-4.5.6.tm at line 913.
We are not using the agent
To Reproduce Steps to reproduce the behavior:
Expected behavior The load completes with no errors.
Screenshots
There are many lines of jumbled characters and then:
Scripts Used pg_build.sh
pg_tproch_buildschema.tcl is the script included with HammerDB, with altered connection information
HammerDB Version (please complete the following information):
HammerDB Interface (please complete the following information):
Operating System (please complete the following information):
Database Server (please complete the following information):
Database Client (please complete the following information):