apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 87 forks source link

adbc.snowflake.statement.* options are not effective ["python"] #2146

Open davlee1972 opened 1 week ago

davlee1972 commented 1 week ago

What happened?

I've tried changing adbc.snowflake.statement options, but none of the settings appear to have any effect with adbc_ingest().

I've tried the following:

"adbc.snowflake.statement.ingest_copy_concurrency": "0", "adbc.snowflake.statement.ingest_copy_concurrency": "1",

"adbc.snowflake.statement.ingest_target_file_size": "100mb", "adbc.snowflake.statement.ingest_target_file_size": "100", "adbc.snowflake.statement.ingest_target_file_size": "100000000",

But the result always ends up with ~120 PUTs with concurrent overlapping COPY INTO(s) running in parallel..

Changing any of the above to int(s) like: "adbc.snowflake.statement.ingest_copy_concurrency": 0, results in an error which is expected, so the parameters are being checked, but they just don't impact anything..

  File "../miniconda3/lib/python3.9/site-packages/adbc_driver_snowflake/__init__.py", line 156, in connect
    return adbc_driver_manager.AdbcDatabase(driver=_driver_path(), **kwargs)
  File "adbc_driver_manager/_lib.pyx", line 482, in adbc_driver_manager._lib.AdbcDatabase.__init__
  File "adbc_driver_manager/_lib.pyx", line 276, in adbc_driver_manager._lib._to_bytes
ValueError: value must be str or bytes

I'm also passing a recordbatchreader into adbc_ingest()

Stack Trace

No response

How can we reproduce the bug?

No response

Environment/Setup

No response

zeroshade commented 1 week ago

@joellubi can you take a look at this when you get a chance? thanks!

joellubi commented 1 week ago

@davlee1972 Can you please share the code you used to pass in those parameters?

Based on the traceback the error is occurring during database initialization so my guess is those parameters are being provided as database kwargs. The ingestion parameters are defined on the statement rather than the database. They can be set as follows:

with adbc_driver_snowflake.dbapi.connect(uri) as conn, conn.cursor() as cur:
    cur.adbc_statement.set_options(**kwargs)
    cur.adbc_ingest(...)
joellubi commented 1 week ago

FWIW I do think the way parameters handling works right now is kind of awkward. I would be for cleaning it up some way if it can help make this more intuitive.

davlee1972 commented 1 week ago

Ok, I'll try that, but it is a kinda awkward. The documentation isn't clear that these are adbc_statement options as part of the cursor since the options are show right after the database kwargs sample.

image

joellubi commented 1 week ago

I think we should update the documentation to make this clearer, and perhaps make the interface more consistent.

@lidavidm Do you know if adding optional parameters to Connection.cursor() would be a problem for the dbapi spec?

I'm thinking for consistency to allow stmt_kwargs to be specified similar to how the db and conn kwargs already are:

with adbc_driver_snowflake.dbapi.connect(uri, db_kwargs=db_kwargs, conn_kwargs=conn_kwargs) as conn:
    with conn.cursor(stmt_kwargs=stmt_kwargs) as cur:
        cur.adbc_ingest(...)

Thoughts on this?

CC: @zeroshade

lidavidm commented 1 week ago

I think it would be OK only as a keyword-only argument like adbc_stmt_kwargs