dirkjonker / sqlalchemy-turbodbc

SQLAlchemy dialect for Turbodbc
MIT License
23 stars 5 forks source link

Slower than pyodbc when exporting to SQL (pyodbc: 18 secs, turbodbc: 19, R: 4) #4

Closed garfieldthecat closed 5 years ago

garfieldthecat commented 5 years ago

I created 15 dataframes, each with 10,000 records, 30 columns of floats and 3 columns of text. Exporting all these dataframes to SQL takes:

Why is turbodbc slightly slower than pyodbc? I would have hoped a performance more similar to R's. Is there anything I can do to speed up the process? Thanks in advance!

I am using:

My code is:

import numpy as np
import pandas as pd
import timeit
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy_turbodbc

ServerName = myserver
Database = mydb
params = '?driver=SQL+Server+Native+Client+11.0'

engine2 = create_engine('mssql+turbodbc://' + ServerName + '/' + Database + params, encoding='latin1')
conn2 = engine.connect()

sheets = 15
rows= int(10e3)

def create_data(sheets, rows):
    df = {} # dictionary of dataframes
    for i in range(sheets):
        df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
        df[i]['a'] = 'some long random text'
        df[i]['b'] = 'some more random text'
        df[i]['c'] = 'yet more text'
    return df

def data_to_turbodbc(df):
    for d in df:
        df[d].to_sql('Test ' + str(d) , conn2, if_exists='replace' )

df = create_data(sheets, rows)

t_sql_turbo = timeit.Timer( "data_to_turbodbc(df)" , globals=globals() ).repeat(repeat = 1, number = 1 )
dirkjonker commented 5 years ago

This is a great question, thanks for the elaborate details. Unfortunately, this issue is not related to sqlalchemy-turbodbc, which only helps sqlalchemy to create a connection with Turbodbc.

Fortunately for you I'm willing to help you, your problem seems to be here:

def data_to_turbodbc(df):
    for d in df:
        df[d].to_sql('Test ' + str(d) , conn2, if_exists='replace' )

You are looping over your dataframe and executing 1 SQL statement per row. This is really inefficient. Try to execute it in a single statement if possible, e.g.

def data_to_turbodbc(df):
    df.to_sql('Test', conn2, if_exists='replace' )

If this does not help, this seems to be a good question to ask on StackOverflow. For performance issues with Turbodbc please consider opening an issue at https://github.com/blue-yonder/turbodbc/

I am closing this issue because it is not related to the sqlalchemy-turbodbc package.

garfieldthecat commented 5 years ago

Thank you for looking into this! However, I believe you have made some confusion. In my code, df is a dictionary containing dataframes. df[d] is a dataframe , so df[d].to_sql() outputs a dataframe, not a single row, to sql. In fact, as far as I remember, in pandas to_sql is a method of the DataFrame class only. So I believe my question is still relevant.

Do you happen to have any benchmarks of pyodbc vs turbodbc and maybe vs R or an ETL tool?

The fact that R is much much faster writing the same data to the same SQL server would suggest there is nothing inherently wrong in the SQL server itself - which, by the way, is running on the very same PC I am running these scripts on.

Thanks!

garfieldthecat commented 5 years ago

PS I have rerun the very same identical script. pyodbc and R take the same time, turbodbc now takes 12 seconds. It's an improvement over pyodbc but still way behind R. I wonder if anything can be done to speed it up even more?

dirkjonker commented 5 years ago

This may be related to how Pandas to_sql method is implemented. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html Try method='multi' as a parameter to to_sql