dropbox / PyHive

Python interface to Hive and Presto. 🐝
Other
1.67k stars 549 forks source link

PyHive, Presto connector returning wrong resultset #460

Open alextk87 opened 1 year ago

alextk87 commented 1 year ago

I'm using Presto Cluster for processing large amount of data.

To visualize the data I use the connector provided and suggested by the official Superset documentation, which is PyHive from the SQLAlchemy library and I'm using the default settings for the connection.

When using the provided pyhive presto connector and executing a very simple query - "SELECT * FROM test_table", the returned number of rows by the resultset is incorrect compared with the same query executed in the presto-cli app, the official connector provided by the Presto documentation.

I created two simple python scripts to test Presto connection using PyHive and the official jdbc.jar driver.

The PyHive connector returned wrong number of rows in the resultset about 817000 rows, exactly the same number of rows that was returned by the Superset chart. The connector with the official jdbc driver returned the correct amount of data - 875000 rows.

It looks like the issue is caused by the PyHive connector. Is it possible to change the connection method from PyHive to the official JDBC driver?

I'm attaching the two python scripts that I used to reproduce the issue.

#This Python script is using PyHive
from pyhive import presto

def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
    connection = presto.connect(host=host, port=port, username=user, catalog=catalog, schema=schema, session_props={'query_max_output_size': '1TB'})

    try:
        cursor = connection.cursor()
        query = f"""SELECT * FROM test_table"""
        cursor.execute(query)
        total_rows = 0

        while True:
            rows = cursor.fetchmany(max_rows)

            if not rows:
                break

            for row in rows:
                total_rows += 1
                print(row)

    except Exception as e:
        print("Error executing the query:", e)

    finally:
        print(total_rows)
        cursor.close()
        connection.close()

if __name__ == "__main__":
    host = "localhost"
    port = 30000
    user = "testUser"
    catalog = "pinot"
    schema = "default"
    table = "test_table"
    max_rows = 1000000

    execute_presto_query(host, port, user, catalog, schema, table, max_rows)
#This Python script is using the official JDBC driver
import jaydebeapi
import jpype

def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
    jar_file = '/home/admin1/Downloads/presto-jdbc-0.282.jar'
    jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=" + jar_file)
    connection_url = f'jdbc:presto://{host}:{port}/{catalog}/{schema}'
    conn = jaydebeapi.connect(
        'com.facebook.presto.jdbc.PrestoDriver',
        connection_url,
        {'user': user},
        jar_file
    )

    try:
        cursor = conn.cursor()
        query = f"SELECT * FROM test_table"
        cursor.execute(query)
        rows = cursor.fetchall()
        for row in rows:
            print(row)

        print(f"Total rows returned: {len(rows)}")

    except Exception as e:
        print("Error executing the query:", e)

    finally:
        cursor.close()
        conn.close()

    jpype.shutdownJVM()

if __name__ == "__main__":
    host = "localhost"
    port = 30000
    user = "testUsername"
    catalog = "pinot"
    schema = "default"
    table = "test_table"
    max_rows = 1000000

    execute_presto_query(host, port, user, catalog, schema, table, max_rows)
mdeshmu commented 1 year ago

Which version of PyHive and SQLAlchemy are you using?

alextk87 commented 1 year ago

Sorry for my late reply, I was on a vacation. The version of PyHive is 0.6.2 and the version of SQLAlchemy is 2.0.19. The issue is reproducible with result sets larger than 150000 records.

mdeshmu commented 1 year ago

Can you try latest version of PyHive i.e. 0.7.0 to check whether issue still exists.