TPC-Council / HammerDB

HammerDB Database Load Testing and Benchmarking Tool
http://www.hammerdb.com
GNU General Public License v3.0
545 stars 115 forks source link

Add BCP based load to TPROC-H Schema for SQL Server #593

Closed sm-shaw closed 9 months ago

sm-shaw commented 10 months ago

Pull Requests #587 and #592 have added a BCP based load to the TPROC-C schema that improves loading performance. The same approach could be added to the TPROC-H schema to also improve performance there, using these pull requests as a template for the changes that need to be added.

JoshInnis commented 10 months ago

Hello,

I work on the same team as Krithika Satish and I am using those PRs as templates to alter the TPROC-H loading performance.

The template from those PRs are working well for this load process to with one issue. The address field for the supplier and customer tables is created with this command: set address [ V_STR 25 ]

This command will allow addresses to be created with commas in it. Which causes issues with BCP. The current solution we have is to run the command: regsub -all "," $address " " address Which replaces the commas with spaces. This logic currently only runs when the BCP logic is run.

sm-shaw commented 10 months ago

Hi,

We shouldn't change the data as that is what is derived from the TPC-H specification. However, I would be really surprised if BCP didn't work with commas in it, as I would think this would be a fairly typical requirement. Looking at the documentation:

https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16&redirectedfrom=MSDN https://learn.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-ver16

It looks like the "-t" flag is used to set the delimiter and so far for TPROC-C we are setting a delimiter as a comma i.e. -t ","

exec bcp $tableName IN $filePath -b 500000 -a 16000 -T -S $server -c -t ","

Therefore, for data with commas in it be able to change the delimiter for something else.

As an example of what do see the commit here: https://github.com/TPC-Council/HammerDB/pull/592/commits/a0e96c0d7cedec7a38734d0284aa010cf6e3eb55

In the procedures creating the data you could set a variable called delimiter and then do upvar 2 in the bcp procedure to reference the setting for a particular table being loaded and then this delimiter would be used in the bcp procedure as follows:

exec bcp $tableName IN $filePath -b 500000 -a 16000 -T -S $server -c -t "$delimiter"

To be sure it is set you could also use the following so if delimiter is not set then a comma is used by default.

upvar 2 delimiter delimiter if {![info exists $delimiter]} { set delimiter "," }

For manual bulk loads I have previously used a pipe delimiter "|" https://www.hammerdb.com/docs/ch13s04.html

JoshInnis commented 10 months ago

Would it make sense instead to use pipes '|' for all temporary files in the TPCH load? That would simplify the codebase a little bit.

JoshInnis commented 10 months ago

The bcp utility has some restrictions to date formatting that are causing an issue. The function mk_timein tpchcommon-1.0.tm formats dates with the three-letter abbreviation of the month, i.e.. 2023-SEP-14. However, in testing, BCP requires the numbered syntax: 2023-09-14, otherwise a parsing error is thrown.

Currently I made a copy on the function mk_time called mk_time_bcp that makes the necessary alteration for BCP (TPCC uses datetime in its tables, whereas TPCH uses date, which is why this issue didn't appear earlier).

If this is an adequate solution, does this new function belong in tpchcommon-1.0.tm or in mssqlsolap.tcl as this is the only place in the codebase where this function is needed?

sm-shaw commented 10 months ago

Would it make sense instead to use pipes '|' for all temporary files in the TPCH load? That would simplify the codebase a little bit.

Yes, certainly if this is easier for you, then you can use a pipe delimiter for TPROC-H. A comma was only used as that is what was used for the TPROC-C BCP build.

sm-shaw commented 10 months ago

Currently I made a copy on the function mk_time called mk_time_bcp that makes the necessary alteration for BCP (TPCC uses datetime in its tables, whereas TPCH uses date, which is why this issue didn't appear earlier).

If this is an adequate solution, does this new function belong in tpchcommon-1.0.tm or in mssqlsolap.tcl as this is the only place in the codebase where this function is needed?

Good question and solution. Let's keep it in tpchcommon-1.0.tm so it is next to the original function for reference and maybe just add a very brief comment above why it is needed for a reminder in future.