cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
730 stars 248 forks source link

Impyla + Dask : How to setup the impala URI ? #556

Open frbelotto opened 1 month ago

frbelotto commented 1 month ago

I currently own and use a impala connection using impyla.

import sqlalchemy
import pandas as pd
def conn():
    return connect(host=host, 
                   port=port,
                   database='default',
                   timeout=60,
                   use_ssl=True,
                   auth_mechanism="LDAP",
                   use_http_transport=True,
                   http_path=path,
                   user=user, 
                   password=pwd)
engine = sqlalchemy.create_engine('impala://', creator=conn)
query = '''select * from HIVE_ENI.NVG_USU_CNL_DGTL as t1 '''
pd.read_sql(query, db, index_col ='horario' )

It does work, despite of a deprecation warning :

SADeprecationWarning: The dbapi() classmethod on dialect classes has been renamed to import_dbapi().  Implement an import_dbapi() classmethod directly on class <class 'impala.sqlalchemy.ImpalaDialect'> to remove this warning; the old .dbapi() classmethod may be maintained for backwards compatibility.
  engine = sqlalchemy.create_engine('impala://', creator=conn)

But as my base is getting bigger, I am trying to move from pandas to dask. The issue is that dask requires the connection string instead of the engine :

import dask.dataframe as dd
dd.read_sql(query, db, index_col ='horario' )
`TypeError: con must be of type str, not <class 'sqlalchemy.engine.base.Engine'>Note: Dask does not support SQLAlchemy connectables here`

It might be stupid, but,

1) How could I solve the DeprecationWarning on the engine creation? 2) how do I create the connection URI for my server given the data I´ve showed aboveto be used on Dask?