druid-io / pydruid

A Python connector for Druid
Other
506 stars 194 forks source link

Lack of description for empty result set causes SQLAlchemy exception #151

Closed john-bodley closed 5 years ago

john-bodley commented 5 years ago

SQLAlchemy throws an AttributeError when fetching from a ResultProxy when the underlying Druid query results in an empty result set, e.g. no rows, as the proxy is prematurely closed, i.e.,

>>> from sqlalchemy.engine import create_engine

>>> engine = create_engine('druid://localhost:8082/druid/v2/sql')
>>> result = engine.execute("SELECT SUM(x) FROM foo WHERE bar IS NULL")
>>> result.fetchall()
...
AttributeError: 'NoneType' object has no attribute 'fetchall'

The reason the ResultProxy is closed is because cursor.description is None. This results in result.cursor being None which is why the exception is thrown.

It seems that other dialects have a description irrespective of whether there are rows and hence don't suffer from the same issue. Note I'm uncertain whether there's even a viable solution as the Druid SQL REST API doesn't provide this information for an empty result set,

> cat query.json
{"query": "SELECT SUM(x) FROM foo WHERE bar IS NULL"}

> curl -XPOST -H 'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @query.json
[]

Note I know one can circumvent this issue by using a raw connection, however this example illustrates the behavior that Pandas uses for reading SQL.

to: @betodealmeida @mistercrunch

betodealmeida commented 5 years ago

@john-bodley note that cursor.description should be None in this case:

This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet. (https://www.python.org/dev/peps/pep-0249/#description)

betodealmeida commented 5 years ago

Maybe we're missing something in the SQLAlchemy dialect?

betodealmeida commented 5 years ago

@john-bodley note too that .fetchall() should return an exception when no rows are returned:

An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet. (https://www.python.org/dev/peps/pep-0249/#fetchall)

So I think this is behaving as expected.

john-bodley commented 5 years ago

@betodealmeida the fetchall isn’t throwing the exception, but instead the underlying cursor is None because it’s prematurely closed.

betodealmeida commented 5 years ago

@john-bodley but my point is, you should always wrap your .fetchall() in a try/except block, in case zero rows were returned.

john-bodley commented 5 years ago

@betodealmeida note's it's the ResultProxy fetchall method that I'm calling which wraps the underlying DBAPI and has slightly different behavior (per the documentation).

Fetch all rows, just like DB-API cursor.fetchall().

After all rows have been exhausted, the underlying DBAPI cursor resource is released, and the object may be safely discarded.

Subsequent calls to ResultProxy.fetchall() will return an empty list. After the ResultProxy.close() method is called, the method will raise ResourceClosedError.

john-bodley commented 5 years ago

@betodealmeida I created https://github.com/sqlalchemy/sqlalchemy/issues/4523 to discuss how one should handle an empty result set associated with a SELECT statement when no description is available.

It seems pydruid should provide a description even if no rows are returned, the issue is it may be non-trivial to determine the column names to add to the description given the lack of metadata provided by the Druid SQL REST API.

john-bodley commented 5 years ago

@betodealmeida it seems that https://github.com/apache/incubator-druid/pull/6191 will help remedy the solution as the SQL interface provides a mechanism for returning the column names via the X-Druid-Column-Names header which will be included in version 0.13.0 (currently incubating).

Makesh-Gmak commented 5 years ago

@john-bodley : I am also impacted by this issue. I am currently using Druid 0.13.0. But still I am not receiving X-Druid-Column-Names header in my response. Any druid config file need to be updated for this ?

Makesh-Gmak commented 5 years ago

As a part of this bug (https://github.com/apache/incubator-druid/issues/6409,) , it looks like they moved the column info to response body which can be seen by setting "header" : true in the request, which will never return empty response.

The below query will return 1 row with all columns Query

{
  "query":"SELECT delta, page, added, deleted FROM wikipedia   WHERE 1=0  ",
  "resultFormat" : "object",
  "header" : true
}

Result

[
    {
        "delta": null,
        "page": null,
        "added": null,
        "deleted": null
    }
]

More: Section "Response" under http://druid.io/docs/latest/querying/sql.html

john-bodley commented 5 years ago

@Makesh-Gmak I can look into providing a fix for this.

@betodealmeida what are your thoughts on how to best proceed? It seems likes including the header in the request for older versions doesn't break things, however one cannot differentiate between the responses whether the header is present, i.e., it merely inserts a new record at the beginning of the result set, .e.g.,

[
    {
        "delta": null,
        "page": null,
        "added": null,
        "deleted": null
    },
    {
        "delta": 36,
        "page": "Talk:Oswald Tilghman",
        "added": 36,
        "deleted": 0
    }
]

Checking whether the first records has all NULLs as values isn't a robust method as it's definitely plausible to have a result set which includes NULLs. I was thinking the only solutions include:

  1. Adding the /status API call to determine the version of the Druid broker prior to executing each query.
  2. Modify the connection to include the Druid version or whether headers are supported (preferred).

Note (2) is more performant than (1) however it requires additional metadata.

Makesh-Gmak commented 5 years ago

Thanks for the resolution. When will be the next release ?