TPC-Council / HammerDB

HammerDB Database Load Testing and Benchmarking Tool
http://www.hammerdb.com
GNU General Public License v3.0
542 stars 115 forks source link

Thread "tid0x7f533bd15740" does not exists error #697

Closed daga1968 closed 2 months ago

daga1968 commented 2 months ago

OS: Red Hat Enterprise Linux release 8.7 (Ootpa) Mem: 24 Gigs DB2: v11.5.9.0

Hammer App: V4.8 Installed on Ubuntu xxx Ubuntu is installed on Windows 11 / WSL No graphic Interface, only Text This config has been successfully used on db2 servers: DB2 v11.1 AIX, DB2 v11.5.8 Linux

Build step:

dbset db db2

diset connection db2_def_user {db2wasd} diset connection db2_def_pass {*****} diset connection db2_def_dbase {HMLWD115}

diset tpcc db2_user {db2wasd} diset tpcc db2_pass {*****} diset tpcc db2_dbase {HMLWD115}

diset tpcc db2_count_ware {20} diset tpcc db2_num_vu {5}

buildschema waittocomplete quit

Running Build step was successful, attached is the full log, here are the ending lines of it: .. Vuser 1:Statistics Complete Vuser 1:DB2WASD SCHEMA COMPLETE Vuser 1:FINISHED SUCCESS ALL VIRTUAL USERS COMPLETE waittocomplete command has been deprecated and is not required for version v4.8 Shutting down HammerDB CLI ..

Now, TEST step, when it fails:

test script:

dbset db db2 diset connection db2_def_user {db2wasd} diset connection db2_def_pass {*****} diset connection db2_def_dbase {HMLWD115}

diset tpcc db2_user {db2wasd} diset tpcc db2_pass {*****} diset tpcc db2_dbase {HMLWD115}

diset tpcc db2_count_ware {20} diset tpcc db2_num_vu {5} diset tpcc db2_driver {timed} diset tpcc db2_allwarehouse {true} diset tpcc db2_timeprofile {true}

diset tpcc db2_rampup {2} diset tpcc db2_duration {5}

vuset logtotemp 1 vuset unique 1 tcset logtotemp 1 tcset unique 0 tcset timestamps 1 loadscript vuset vu 100 vuset logtotemp 1 vucreate tcstart tcstatus vurun runtimer 2000 vudestroy tcstop quit

Output: (full output is attached due to size)

runtimer command has been deprecated and is not required for version v4.8 6 Db2 tpm 6 Db2 tpm Virtual Users remain running in background or shutting down, retry Transaction Counter thread running with threadid:tid0x7f5185ffb700 Stopping Transaction Counter Closed Transaction Counter Log Shutting down HammerDB CLI Error from thread tid0x7f5185ffb700 thread "tid0x7f533bd15740" does not exist while executing "thread::send -async tid0x7f533bd15740 { post_kill_transcount_cleanup }" ("eval" body line 1) invoked from within "eval [ subst {thread::send -async $MASTER { post_kill_transcount_cleanup }} ]" (procedure "read_more" line 115) invoked from within "read_more tid0x7f533bd15740 db2tcl db2wasd ***** HMLWD115 db2inst1 ibmdb2 tpch 10 0 tce TPC-C " Error from thread tid0x7f51867fc700 thread "tid0x7f533bd15740" does not exist while executing "thread::send -async tid0x7f533bd15740 {::myerrorproc tid0x7f51867fc700 {}}" ("eval" body line 1) invoked from within "eval [subst {thread::send -async $MASTER {::myerrorproc [list $ID $result]}}]" (procedure "runVuser" line 5) invoked from within "runVuser tid0x7f533bd15740 tid0x7f51867fc700 1 500 {#!/usr/local/bin/tclsh8.6

EDITABLE OPTIONS

set ..." Error from thread tid0x7f5186ffd700 thread "tid0x7f533bd15740" does not exist while executing "thread::send -async tid0x7f533bd15740 {::myerrorproc tid0x7f5186ffd700 {[IBM][CLI Driver] SQL30081N A communication error has been detected. Communic..." ("eval" body line 1) invoked from within "eval [subst {thread::send -async $MASTER {::myerrorproc [list $ID $result]}}]" (procedure "runVuser" line 5) invoked from within "runVuser tid0x7f533bd15740 tid0x7f5186ffd700 1 500 {#!/usr/local/bin/tclsh8.6

EDITABLE OPTIONS

set ..." Error from thread tid0x7f51877fe700 thread "tid0x7f533bd15740" does not exist while executing launch_linux_DB2WASD_11_5.build.20240423_1720.log

"thread::send -async tid0x7f533bd15740 {::myerrorproc tid0x7f51877fe700 {[IBM][CLI Driver] SQL30081N A communication error has been detected. Communic..." ("eval" body line 1) invoked from within "eval [subst {thread::send -async $MASTER {::myerrorproc [list $ID $result]}}]" (procedure "runVuser" line 5) invoked from within "runVuser tid0x7f533bd15740 tid0x7f51877fe700 1 500 {#!/usr/local/bin/tclsh8.6

EDITABLE OPTIONS

set ..." Error from thread tid0x7f5339888700 thread "tid0x7f533bd15740" does not exist while executing "thread::send -async tid0x7f533bd15740 {::myerrorproc tid0x7f5339888700 {[IBM][CLI Driver] SQL30081N A communication error has been detected. Communic..." ("eval" body line 1) invoked from within "eval [subst {thread::send -async $MASTER {::myerrorproc [list $ID $result]}}]" (procedure "runVuser" line 5) invoked from within "runVuser tid0x7f533bd15740 tid0x7f5339888700 1 500 {#!/usr/local/bin/tclsh8.6

EDITABLE OPTIONS

set ..." Error from thread tid0x7f5339087700 thread "tid0x7f533bd15740" does not exist while executing "thread::send -async tid0x7f533bd15740 {::myerrorproc tid0x7f5339087700 {[IBM][CLI Driver] SQL30081N A communication error has been detected. Communic..." ("eval" body line 1) invoked from within "eval [subst {thread::send -async $MASTER {::myerrorproc [list $ID $result]}}]" (procedure "runVuser" line 5) invoked from within launch_linux_DB2WASD_11_5.HMLWD115.run.20240424_0840.log

"runVuser tid0x7f533bd15740 tid0x7f5339087700 1 500 {#!/usr/local/bin/tclsh8.6

EDITABLE OPTIONS

set ..." 2024-04-24 08:56:51

daga1968 commented 2 months ago

Forgot to specify the Ubuntu Version: 20.04

sm-shaw commented 2 months ago

Many thanks for the issue. Have you been able to replicate on HammerDB v4.10? One part of the issue is that the transaction counter thread cannot communicate back to the main thread. One concern is that the overall through of 6 Db2 tpm looks very low - if it is OK on AIX and Linux it could be a symptom of the overall configuration running very slow delaying the inter-thread communication. The first thing that is needed is to replicate the issue with v4.10 and then for us to do the same so it can indicate where the solution is.

daga1968 commented 2 months ago

Hi,

Thanks for the reply.

I will update HammerDB to v4.10 and be back to you this week

Regards, Dave

daga1968 commented 2 months ago

Hi,

I did the upgrade to Hammer v4.10 and got the same problem (logs attached).

The Linux DB2 server is brand new, thus it has the most recent firewall rules, so I think the problem could be right there.

May you please tell me the TCP ports HammerDB needs to have open in order to allow the Transaction Counter to communicate properly ?, and other ports it may need as well?

Thanks, Dave

daga1968 commented 2 months ago

launch_linux_DB2WASD_11_5.build.20240425_0709.log launch_linux_DB2WASD_11_5.HMLWD115.run.20240425_0733.log

sm-shaw commented 2 months ago

There are no external ports involved for the transaction counter (ports are used for the CPU agent and the web service). In this case it is the threads sending messages to each other and the transaction counter thread tid0x7f76b1ffb700 is sending a message to the main thread tid0x7f7867e0f740, however the main thread has already closed as part of the shutdown routine so there is indication of a CPU scheduling issue. (The fix would be to suppress the error message, however this should not happen in this order).

The key symptom here is that the performance is exceptionally low, you are showing 0 NOPM - so the first thing to do is to try and resolve this and always start with a single virtual user, having 100 virtual users is overconfigured when there is no throughput. (eg 100+ VUs can easily give NOPM in the million+ against all the databases HammerDB supports in a well configured server environment). 1VU should be in the range of 10s of thousands NOPM on any system.

Note we haven't tested on WSL so it is difficult to know if this is the cause, however looking at some of the documentation suggests there may be performance issues e.g.

https://learn.microsoft.com/en-us/windows/wsl/tutorials/wsl-database https://learn.microsoft.com/en-us/windows/wsl/compare-versions

Ensure that you are running your Linux distribution in WSL 2 mode. For help switching from WSL 1 to WSL 2, see [Set your distribution version to WSL 1 or WSL 2](https://learn.microsoft.com/en-us/windows/wsl/basic-commands).

As you can tell from the comparison table above, the WSL 2 architecture outperforms WSL 1 in several ways, with the exception of performance across OS file systems, which can be addressed by storing your project files on the same operating system as the tools you are running to work on the project.

Once you have a system that is giving reasonable performance then it is likely that you will not see this issue so this what to look at first to improve these figures.

Vuser 1:100 Active Virtual Users configured
Vuser 1:TEST RESULT : System achieved 0 NOPM from 10 Db2 TPM
Vuser 1:Gathering timing data from Active Virtual Users...
18 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
6 Db2 tpm
18 Db2 tpm
6 Db2 tpm

Note we do use Virtual Box on Windows as a test environment so this could be a comparison to see if performance is better.

daga1968 commented 2 months ago

Hi,

Thanks for the prompt response.

As I stated before, I have used this config of HammerDB on ubuntu in WSL on Windows 11 for testing and it was very good, here is what we have achieved with it (attached Excel file).

I going to try per your advice: Warehouse 1, Virtual User 1, in order to see what get can get from it, because now we are quite new the model of the virtual cpu of the new virtual servers...

Regards Dave

daga1968 commented 2 months ago

HammerDB Test AIX LINUX.xlsx

sm-shaw commented 2 months ago

My recommendation is to generate a performance profile across a range of virtual users to identify the peak level of performance https://www.hammerdb.com/docs3.3/ch03s06.html You don't need to change the warehouse count for the number of vusers, only create enough to allow an even enough random distribution of VUs. So for example to test 1-10 VUs then 50 warehouses would be reasonable. In a typical server environment I create 1000 warehouses.

daga1968 commented 2 months ago

Got it.

Before I read your comment I tried W1 Vu1 and got the same problem, so I installed HammerDB v4.10 on the Red Hat 8 DB2 Server where I am running the Test, to avoid the Windows 11 / WSL performance issue.

I will post here once Build and Test steps are done.

Regards, Dave

sm-shaw commented 2 months ago

Also try the cpu test described here: https://www.hammerdb.com/docs/ch04s01.html So I create a file eg cuptest.tcl with these contents and in the CLI do "source cputest.tcl" The timing that you get from this test will give a general indication of the single-threaded performance to calibrate any configuration issues

proc runcalc {} {
set n 0
for {set f 1} {$f <= 10000000} {incr f} {
set n [ expr {[::tcl::mathfunc::fmod $n 999999] + sqrt($f)} ] 
}
return $n
}
#puts "bytecode:[::tcl::unsupported::disassemble proc runcalc]"
set start [clock milliseconds]
set output [ runcalc ]
set end [ clock milliseconds]
set duration [expr {($end - $start)}]
puts "Res = [ format %.02f $output ]"
puts "Time elapsed : [ format %.03f [ expr $duration/1000.0 ] ]"
daga1968 commented 2 months ago

Hi,

Finally got results !!!

For W1 Vu 1:

Vuser 1:TEST RESULT : System achieved 29395 NOPM from 129007 Db2 TPM

-- For W20, Vu 5: Vuser 1:TEST RESULT : System achieved 48668 NOPM from 213733 Db2 TPM

Main reason of the problem, why it worked before and not now:

Customer had changed recently the VPN system to a new one.

The new VPN system type is ACTIVE, which means, it is monitoring tcpip traffic 100% so all incoming and outgoing data is delayed quiet a bit, causing problems in all apps, including this one.

Now, working locally on the target server:

Your CPU Test results:

./hammerdbcli auto cpu_test.tcl HammerDB CLI v4.10 Copyright (C) 2003-2024 Steve Shaw Type "help" for a list of commands Initialized Jobs on-disk database /tmp/hammer.DB using existing tables (540,672 KB) Res = 873729.72 Time elapsed : 3.170

I am attaching the logs files of all build and run test which are all now executed locally.

Thanks a lot for your help Steve, very much appreciated.

Regards, Dave

daga1968 commented 2 months ago

launch_linux_DB2WASD_11_5.build.w1v1.20240425_1213.LOCAL.log launch_linux_DB2WASD_11_5.HMLWD115.run.20240425_1250.LOCAL.log launch_linux_DB2WASD_11_5.build.20240425_1230.LOCAL.log launch_linux_DB2WASD_11_5.HMLWD115.run.w1v1.20240425_1220.LOCAL.log

sm-shaw commented 2 months ago

Excellent I'm glad it is resolved. For now it is not worth catching this error as it is an edge case resulting from a configuration issue elsewhere: For the CPU test this looks about right. However just for the record on Linux Ubuntu often boots in powersave mode (although depending on your BIOS settings the hardware may override the OS and perform just as well) which is not ideal for benchmarking performance. So to check you can do the following:

Check the p-state driver is enabled:

dmesg | grep pstate
[   31.482744] intel_pstate: Intel P-state driver initializing
[   31.626162] intel_pstate: HWP enabled

Install the linux-tools package and use cpupower to check the governor:

./cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 3.90 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 800 MHz and 3.90 GHz.
                  The governor "powersave" may decide which speed to use
                   within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 800 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

In this case we were using the powersave governor after boot (which makes sense for laptopsto save battery power which is the default) so we can switch it to performance and also change the energy performance policy i.e. how quickly boosting frequency will kick-in.

./cpupower frequency-set --governor=performance
./x86_energy_perf_policy performance

Run the CPU test, it should be in the range of a few seconds.

hammerdb>source cputest.tcl
Res = 873729.72
Time elapsed : 1.678

If you want it to run for longer, add another 0 to this line for in this case to run for 16 seconds for {set f 1} {$f <= 100000000} {incr f} { You can then use the turbostat utility to see the actual CPU frequency is running in the range that the spec says it should. You can then also use turbostat while HammerDB is running to monitor the frequency with all CPUs active.

Note that virtualization can sometimes impact what the virtualized OS actually sees from these commands.

Just for interest, you can see historical data here for the test this is based on for different systems http://www.juliandyke.com/CPUPerformance/CPUPerformance.php.