cldellow / datasette-parquet

Add DuckDB, Parquet, CSV and JSON lines support to Datasette
Apache License 2.0
47 stars 7 forks source link

Error with facet suggestions #21

Open rgieseke opened 1 year ago

rgieseke commented 1 year ago

Absolutely fantastic project, thanks a lot!

As hinted to in the Readme, disabling facet suggestions might be needed.

Without it disabling I was getting an error when clicking on a view of a parquet file:

datasette_parquet.exceptions.DoubleQuoteForLiteraValue: ['"unit" must appear in the GROUP BY clause or must be part of an aggregate function.\nEither add it to the GROUP BY list, or use "ANY_VALUE(unit)" if the exact value of "unit"']

When started with

datasette serve --metadata metadata.json --setting suggest_facets off

there is no problem.

Not sure if there is a way to check or prevent this error without it, happy to look into it if you have any pointers! ("unit" is a column of string values.)

cldellow commented 1 year ago

Hey, glad you were able to get it working.

I think you're touching on two things.

1) How do we warn people that suggest_facets should probably be disabled?

2) Why is that error happening?

Re (1) - The Datasette object exposes a way to query the setting values. Maybe datasette-parquet can look at this in its startup hook and print a warning?

Re (2) - the rationale for this suggestion was performance, not correctness. When running SQLite queries, Datasette can interrupt them if they take longer than, say, a second. So with SQLite, facet suggestions can be slow, but not prevent the system from working. With DuckDB, queries aren't interruptible, so facet suggestions can make the system appear to hang.

But in this case, it looks like DuckDB thinks the query is invalid, and that's throwing an exception.

This might be due to SQLite's loosey-goosey rules on GROUP BY. That is, SQLite permits the query, but DuckDB rejects it.

There is a very real chance that, although that query "works" in SQLite, it does not do what the author intended. Do you have a full stack trace? If we could figure out which facet is generating that SQL, we might want to open a bug against it.

rgieseke commented 1 year ago

Thanks for the quick feedback!

Here is the full stacktrace:

Traceback (most recent call last):
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette_parquet/winging_it.py", line 94, in execute
    rv = self.cursor.execute(sql, parameters)
duckdb.duckdb.BinderException: Binder Error: column "provenance" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(provenance)" if the exact value of "provenance" is not important.
LINE 1: SELECT provenance AS value, COUNT(*) AS n FROM...
               ^

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/app.py", line 1354, in route_path
    response = await view(request, send)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/base.py", line 134, in view
    return await self.dispatch_request(request)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/base.py", line 91, in dispatch_request
    return await handler(request)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/base.py", line 361, in get
    response_or_template_contexts = await self.data(request, **data_kwargs)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/table.py", line 158, in data
    return await self._data_traced(request, default_labels, _next, _size)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/table.py", line 562, in _data_traced
    await gather(execute_facets(), execute_suggested_facets())
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/table.py", line 180, in gather
    results.append(await fn)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/table.py", line 559, in execute_suggested_facets
    for suggest_result in await gather(*facet_suggest_awaitables):
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/views/table.py", line 180, in gather
    results.append(await fn)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/facets.py", line 179, in suggest
    distinct_values = await self.ds.execute(
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/app.py", line 782, in execute
    return await self.databases[db_name].execute(
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/database.py", line 267, in execute
    results = await self.execute_fn(sql_operation_in_thread)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette_parquet/ducky.py", line 88, in execute_fn
    return await asyncio.get_event_loop().run_in_executor(
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette_parquet/ducky.py", line 86, in in_thread
    return fn(self.conn)
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette/database.py", line 237, in sql_operation_in_thread
    cursor.execute(sql, params if params is not None else {})
  File "/home/robert/test/env/lib/python3.10/site-packages/datasette_parquet/winging_it.py", line 98, in execute
    raise exceptions.DoubleQuoteForLiteraValue(matches)
datasette_parquet.exceptions.DoubleQuoteForLiteraValue: ['"provenance" must appear in the GROUP BY clause or must be part of an aggregate function.\nEither add it to the GROUP BY list, or use "ANY_VALUE(provenance)" if the exact value of "provenance"']
INFO:     127.0.0.1:38642 - "GET /testdb/test HTTP/1.1" 500 Internal Server Error
/usr/lib/python3.10/asyncio/base_events.py:1910: RuntimeWarning: coroutine 'ArrayFacet.suggest' was never awaited
  handle = None  # Needed to break cycles when an exception occurs.
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib/python3.10/asyncio/base_events.py:1910: RuntimeWarning: coroutine 'DateFacet.suggest' was never awaited
  handle = None  # Needed to break cycles when an exception occurs.
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

GitHub doesn't allow me to upload a parquet file but I used a small test file like this:

image

{
  "plugins": {
    "datasette-parquet": {
      "testdb": {
        "directory": "./testdata"
      }
    }
  }
}

Error happens with

datasette --metadata metadata.json

No error with

datasette --metadata metadata.json --setting suggest_facets off
rgieseke commented 1 year ago

Seems to have to do with the value column. If I remove that there is no problem. So something as value is probably wrong as the query can't differentiate between the two later I assume.

The query seems to be (from commenting out the print statement in winging_it.py )

SELECT provenance AS value, COUNT(*) AS n FROM (SELECT value, provenance, code, category, year FROM test) WHERE NOT value IS NULL GROUP BY value LIMIT 31
rgieseke commented 1 year ago

Here is the query in Datasette: https://github.com/simonw/datasette/blob/452a587e236ef642cbc6ae345b58767ea8420cb5/datasette/facets.py#L168

When I run the test data in a Sqlite db with Datasette this works, so this seems to be something working differently with DuckDB.

rgieseke commented 1 year ago

It's a problem with Datasette's query (when there is a 'value' column), I've created an issue there: https://github.com/simonw/datasette/issues/2208

cldellow commented 1 year ago

Ah, great, thanks for chasing that down!

I'll leave this open as a reminder that it'd be good to add a warning about facet suggestions on startup