JohnOmernik / sqlalchemy-drill

Apache Drill Dialect for SQL Alchemy
Other
53 stars 35 forks source link

Unexpected error module 'sqlalchemy_drill.drilldbapi' has no attribute 'Error' in Superset #42

Closed jwalters-gpsw closed 3 years ago

jwalters-gpsw commented 4 years ago

Using the latest preset/superset docker container in a kubernetes cluster and when trying to "test" a drill database connection in superset am getting this error in the gunicorn hosted superset:

Unexpected error module 'sqlalchemy_drill.drilldbapi' has no attribute 'Error'
superset
[Jun 15, 2020 3:57:53 PM GMT-7]
ERROR:superset.views.core:Unexpected error module 'sqlalchemy_drill.drilldbapi' has no attribute 'Error'

Unfortunately the stack trace isn't showing up in the logs. It seems to be a mismatch between the superset code and the driver logic. Any suggestions as to how to further debug?

ewheeler commented 4 years ago

@jwalters-gpsw i saw the same exception and in my case increasing timeouts kept this error from popping up

try:

  1. doubling gunicorn's timeout (e.g., starting up gunicorn with a larger timeout like gunicorn -w 10 --timeout 240 --limit-request-line 0 --limit-request-field_size 0 "superset.app:create_app()")
  2. setting similarly generous timeouts for superset's SQLLAB_TIMEOUT and SUPERSET_WEBSERVER_TIMEOUT in your superset_config.py

i had to make two small changes to this library as well:

first change is related to timeouts: i added a timeout=None parameter when calling the Drill API with requests here (session.post): https://github.com/JohnOmernik/sqlalchemy-drill/blob/master/sqlalchemy_drill/drilldbapi/_drilldbapi.py#L79

this way requests will not timeout for long-running Drill queries. i don't have or need a very performant Drill set up so this might not be an issue with your Drill installation- i've found that using superset's query caching with redis is preferable to spending more money on a beefier Drill installation

second change to make adding tables in superset work for me: i changed this line https://github.com/JohnOmernik/sqlalchemy-drill/blob/master/sqlalchemy_drill/base.py#L156

from: formatted_schema = plugin + "`.`" + table + "`" to: formatted_schema = plugin + "." + schema_parts[1] + ".`" + schema_parts[2] + "`"

i believe this change fixes one of the problems described by @coderfender in https://github.com/JohnOmernik/sqlalchemy-drill/issues/25

i'm new to Drill so am not yet sure if my fix is specific/unique to my Drill config/use-case (querying parquet files stored in an azure blob container via Drill's azure-blob-storage-plugin) or if this change would negatively affect typical usage. if anyone has suggestions for how to determine this, please let me know!

n.b. i also made some changes to workaround issues with superset+drill handling of datetime and JSON columns as well. happy to share too if these would be useful to folks!

cgivre commented 4 years ago

@ewheeler Can you share your workarounds? Also, can you share how you configured your superset to work with caching? I've had a lot of difficulties getting that to work properly.

jwalters-gpsw commented 4 years ago

I was able to connect (test connection) from superset with the sqlalchemy-drill driver installed into the venv where I pip installed superset (Python 3.7.5). This is against the drill-1.16 container.

I don't know why it's different now, but the test connection works. Yippee. Now I need to figure out how to get the sample data into superset.

ewheeler commented 4 years ago

@cgivre sure- i've forked and pushed the fixes mentioned above here and described the superset_config.py parts below

regarding the datetime and json workarounds this is the change to sqlalchemy-drill relevant to json that prevents superset from choking on json columns. note that these json workarounds are to avoid exceptions that may prevent successful addition of a table as a datasource in superset-- json columns probably won't be usable in superset charts unless you add an expression to each json column's record via the superset ui

that change is complemented by the following addition to superset_config.py

in summary, this uses sqlalchemy's event listeners to hook into column changes before they are flushed and committed. it looks for certain expected column names and then sets the column type appropriately. this avoids the need to navigate to the superset column record edit page after adding the table to manually set the is_temporal checkbox and change the column type. note that i've put this in my custom SupersetSecurityManager where i've defined some other custom behaviors- this can probably stand on its own (pasting as-is since I haven't tested and confirmed)

import logging
import sqlalchemy as sqla   
from superset.security import SupersetSecurityManager 

log = logging.getLogger(__name__)  

class CustomSecurityManager(SupersetSecurityManager):  

    @sqla.event.listens_for(Table, "column_reflect")                               
    def set_column_types(inspector, table, column_info):                           
        #our datetime columns in parquet come through as `UserDefinedType`,        
        #so set type as `DateTime` so superset knows to consider these as `is temporal` 
        #https://docs.sqlalchemy.org/en/13/core/custom_types.html#working-with-custom-types-and-reflection

        temporal_cols = ["created_on", "sent_on"]                               
        if column_info["name"] in temporal_cols:                            
            column_info["type"] = sqla.types.DateTime()                     
            log.info("changed {0} col to datetime in {1} table".format(column_info["name"], table))

        json_cols = ["groups", "labels"]                                                          
        if column_info["name"] in json_cols:                                
            column_info["type"] = sqla.types.JSON()                         
            log.info("changed {0} col to JSON in {1} table".format(column_info["name"], table))

CUSTOM_SECURITY_MANAGER = CustomSecurityManager 

to enable caching, here are the relevant bits from my superset_config.py

CACHE_CONFIG = {                                                                
    'CACHE_TYPE': 'redis',                                                      
    'CACHE_DEFAULT_TIMEOUT': 60 * 60 * 24, # 1 day default (in secs)            
    'CACHE_KEY_PREFIX': 'superset_results',                                     
    'CACHE_REDIS_URL': 'redis://localhost:6379/0',                              
}                                                                               

from celery.schedules import crontab                                            
REDIS_HOST = 'localhost'                                                        
REDIS_PORT = '6379'   

from werkzeug.contrib.cache import RedisCache                                   
RESULTS_BACKEND = RedisCache(                                                   
    host='localhost', port=6379, key_prefix='superset_results') 

class CeleryConfig(object):                                                                             
    #   https://stackoverflow.com/a/61750648                                    
    #   $ celery worker --app=superset.tasks.celery_app:app -Ofair -f /var/log/superset/celery_worker.log &&
    #         celery beat --app=superset.tasks.celery_app:app -f /var/log/superset/celery_beat.log                                                                         
    #   $ celery flower --app=superset.tasks.celery_app:app                       

    BROKER_URL = "redis://%s:%s/0" % (REDIS_HOST, REDIS_PORT)                   
    CELERY_IMPORTS = (                                                          
        'superset.sql_lab',                                                     
        'superset.tasks',                                                       
    )                                                                           
    CELERY_RESULT_BACKEND = "redis://%s:%s/1" % (REDIS_HOST, REDIS_PORT)        
    CELERY_ANNOTATIONS = {                                                      
        'sql_lab.get_sql_results': {                                            
            'rate_limit': '100/s',                                              
        },                                                                      
        'email_reports.send': {                                                 
            'rate_limit': '1/s',                                                
            'time_limit': 120,                                                  
            'soft_time_limit': 150,                                             
            'ignore_result': True,                                              
        },                                                                      
    }                                                                           
    CELERY_TASK_PROTOCOL = 1                                                    
    CELERYBEAT_SCHEDULE = {                                                     
        'email_reports.schedule_hourly': {                                      
            'task': 'email_reports.schedule_hourly',                            
            'schedule': crontab(minute='1', hour='*'),                          
        },                                                                      
    }                                                                           
CELERY_CONFIG = CeleryConfig  
cgivre commented 4 years ago

@ewheeler Could you create a PR with these changes? I just made some mods to fix this issue as well. There was a previous PR which seemed to break a few things. This has been fixed.

jpedrick commented 3 years ago

These changes sound like they might help with SuperSet, but there still seems to be an underlying problem causing the exception drillabapi "has no attribute 'Error'". Seems like "self.dialect.dbapi.Error" is missing when using drill+sadrill:// type connections.

jpedrick commented 3 years ago

These changes sound like they might help with SuperSet, but there still seems to be an underlying problem causing the exception drillabapi "has no attribute 'Error'". Seems like "self.dialect.dbapi.Error" is missing when using drill+sadrill:// type connections.

In my case, I found this error is thrown when the query returns zero rows. However, drill still provides the query schema. Is it possible for sqlalchemy-drill to return an empty result set without throwing an error?

CarlosRCDev commented 3 years ago

A quick fix to resolve this error is append in file: sqlalchemydrill.drilldbapi.__init_\.py:

from .api_exceptions import

And complete result:

from ._drilldbapi import from .api_exceptions import

Bash snipet:

echo "from .api_exceptions import *" >> /usr/local/lib/python3.7/site-packages/sqlalchemydrill/drilldbapi/__init_\.py

cgivre commented 3 years ago

@CarlosRCDev Can you create a PR with this fix? We can get it merged quickly.

cgivre commented 3 years ago

I believe this was fixed in the context of some of the recent updates.