Teradata / PyTd

A Python Module to make it easy to script powerful interactions with Teradata Database in a DevOps friendly way.
MIT License
108 stars 43 forks source link

SQL Assistant vs PyTd #40

Closed pGit1 closed 8 years ago

pGit1 commented 8 years ago

When executing a query via session.execute(file="\...\file.sql") I can write to a text file using standard python IO interface. Here is my code:

result = session.execute(file="\...\file.sql")

text_file = open("\....\outfile.txt", "w")

#Add First Row as Headers
col = result.description
for i in col:
    text_file.write(i[0]+'|')
text_file.write('\n')

#Add Data to file One row at a time
for i in result:
    for j in i.values:
        #print str(j)
        text_file.write(safe_str(j)+'|')
    text_file.write('\n')
text_file.close()

This works great for small datasets but is still VERY (prohibitively) slow for large datasets. From SQL Assistant I can use the export utility to write large text files to a location in reasonable time. Using python it can take more than triple or quadruple the time.

I thought there was a fix for this but it does not seem to be working for me. Is there a way to remedy this? Please help with example code if possible. Your help is MUCH appreciated!

escheie commented 8 years ago

Using the python module to export data will never be as fast as native methods. The value of each column of every row is being converted to a python object before being written to a file. A 3x difference due to this overhead does sound about right. Before the previous bug was fixed, you would have seen times much slower time than that.

Your code could benefit though from setting the number of rows fetched at a time to a value larger than the default of 1. You do this by setting the arraysize attribute on the cursor object as per the Python Database API Specification 2.0. Below is an example of increasing the number of rows fetched at a time to 100:

with session.cursor() as cursor:
     cursor.arraysize = 100
     cursor.execute(file="\...\file.sql")
     for row in cursor:
            #Write to file.

When I fetch 100,000 rows with arraysize = 1, it takes about 14.7 seconds. If I fetch 100,000 rows with arraysize = 100, it takes about 9.7 seconds, so there is about 1.5x performance improvement by increasing this value. Note that setting arraysize too high may actually degrade performance.

pGit1 commented 8 years ago

Thanks for this code Eric! I will integrate this into my own.

On Sat, Apr 9, 2016 at 12:12 PM, Eric Scheie notifications@github.com wrote:

Using the python module to export data will never be as fast as native methods. The value of each column of every row is being converted to a python object before being written to a file. A 3x difference due to this overhead does sound about right. Before the previous bug was fixed, you would have seen times much slower time than that.

Your code could benefit though from setting the number of rows fetched at a time to a value larger than the default of 1. You do this by setting the arraysize attribute on the cursor object as per the Python Database API Specification 2.0. Below is an example of increasing the number of rows fetched at a time to 100:

with session.cursor() as cursor: cursor.arraysize = 100; cursor.execute(file="...\file.sql") for row in cursor:

Write to file.

When I fetch 100,000 rows with arraysize = 1, it takes about 14.7 seconds. If I fetch 100,000 rows with arraysize = 100, it takes about 9.7 seconds, so there is about 1.5x performance improvement by increasing this value. Note that setting arraysize too high may actually degrade performance.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Teradata/PyTd/issues/40#issuecomment-207814063

pGit1 commented 8 years ago

Thanks for the response and the Code Eric!

On Sat, Apr 9, 2016 at 12:12 PM, Eric Scheie notifications@github.com wrote:

Using the python module to export data will never be as fast as native methods. The value of each column of every row is being converted to a python object before being written to a file. A 3x difference due to this overhead does sound about right. Before the previous bug was fixed, you would have seen times much slower time than that.

Your code could benefit though from setting the number of rows fetched at a time to a value larger than the default of 1. You do this by setting the arraysize attribute on the cursor object as per the Python Database API Specification 2.0. Below is an example of increasing the number of rows fetched at a time to 100:

with session.cursor() as cursor: cursor.arraysize = 100; cursor.execute(file="...\file.sql") for row in cursor:

Write to file.

When I fetch 100,000 rows with arraysize = 1, it takes about 14.7 seconds. If I fetch 100,000 rows with arraysize = 100, it takes about 9.7 seconds, so there is about 1.5x performance improvement by increasing this value. Note that setting arraysize too high may actually degrade performance.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Teradata/PyTd/issues/40#issuecomment-207814063