crate / sqlalchemy-cratedb

SQLAlchemy dialect for CrateDB.
https://cratedb.com/docs/sqlalchemy-cratedb/
Apache License 2.0
3 stars 2 forks source link

SQLAlchemy backlog #74

Open amotl opened 2 years ago

amotl commented 2 years ago

Hi there,

while working on crate/crate-python#391, some backlog items have accumulated. I will gather them within this ticket.

Internals

We've identified a few shortcomings in the internal implementation of the CrateDB SQLAlchemy dialect. While it seems to work in general, those spots can well be improved, in order to better align with the internal API hooks of SQLAlchemy, and how the CrateDB dialect interacts with that.

More

With kind regards, Andreas.

robd003 commented 2 years ago

Getting support for async SQLAlchemy would be super useful

I have a few queries that can take slightly over 1 sec to execute and being able to not block would be HUGE

amotl commented 1 year ago

Dear Robert,

support for asynchronous communication with SQLAlchemy, based on the asyncpg and psycopg3 drivers, is being evaluated at crate/crate-python#532. Please note that this is experimental, and we currently have no schedule about when or how this will be released.

With kind regards, Andreas.

amotl commented 1 year ago

Coming from https://github.com/crate/crate-python/pull/553#issuecomment-1545704082 and https://github.com/crate/crate-python/pull/553#discussion_r1192554584, there are a few additional backlog items for SQLAlchemy/pandas/Dask:

[^1]: Rationale: While you will probably know the number of cores in advance if you are professionally scheduling cluster workloads anyway, I think inquiring the number of available cores, and using that figure on demand, still makes totally sense if your program is meant to run on different environments, for example down to Jupyter notebooks, which don't reach out to a cluster, and just process fractions of the whole workload(s) on smaller workstations, but still aim to utilize their resources as good as possible, i.e. to prevent only using 4 cores while 16 would be available. Apologies for that beast of a sentence.

[^2]: dask.from_pandas() function

 ```python
 def from_pandas(
     data: pd.DataFrame | pd.Series,
     npartitions: int | None = None,
     chunksize: int | None = None,
     sort: bool = True,
     name: str | None = None,
 ) -> DataFrame | Series:
     """
     Construct a Dask DataFrame from a Pandas DataFrame

     This splits an in-memory Pandas dataframe into several parts and constructs
     a dask.dataframe from those parts on which Dask.dataframe can operate in
     parallel.  By default, the input dataframe will be sorted by the index to
     produce cleanly-divided partitions (with known divisions).  To preserve the
     input ordering, make sure the input index is monotonically-increasing. The
     ``sort=False`` option will also avoid reordering, but will not result in
     known divisions.

     Note that, despite parallelism, Dask.dataframe may not always be faster
     than Pandas.  We recommend that you stay with Pandas for as long as
     possible before switching to Dask.dataframe.

     npartitions : int, optional
         The number of partitions of the index to create. Note that if there
         are duplicate values or insufficient elements in ``data.index``, the
         output may have fewer partitions than requested.
     chunksize : int, optional
         The desired number of rows per index partition to use. Note that
         depending on the size and index of the dataframe, actual partition
         sizes may vary.
     """
 ```