Open danielballan opened 3 months ago
As @skarakuzu noted, we can hard-code that tabular datasets stored in SQL always have 1
partition. Partitioning has values in file-backed storage but not for database-backed storage.
Notes from conversation:
generate_data_sources
method, as that is used in a file registration use case (tiled register ...
or tiled serve directory ...
) which is not applicable here.init_storage
return list with a single Asset
with just a data_uri
that comes from init_storage
argument.Notes for later:
DataSource
mimetype: application/x-sql-table
parameters: {"table_name": "schema_hash_0fxdkfjsd...", "_id": "..."}
Assets
data_uri: postgresql://... or sqlite://...
We recently added support for appendable tabular data storage, in the
CSVAdapter
. This was done as a prototype in support of the flyscaning project. It is fundamentally not a sound approach, because it relies on an ordinary file on disk as the appendable data store. If a client attempts to read while another client is writing, it is possible for the read to see the file in an inconsistent state (e.g. with a partially-written row). If a client attempts to write while another client is writing---particularly on networked storage---file corruption can result.Fundamentally, given these two requirements:
we inevitably need a proper database. (We could maybe get by with fancy file-locking logic, but that is tantamount to inventing your own database.)
Therefore, I think we need to remove append support from
CSVAdapter
, as it is not robust, and add in its place a new adapter that is backed by a SQL database. This would be a separate SQL database from the others we already have:tiled.authn_database
, which holds (bashed) API keys and other authentication statetiled.catalog
. (There can be multiple catalog databases for a given tiled server, specified in the config file.)Currently, when data is written to Tiled it is always written into ordinary files, and the path for writing is configured thus:
We will need to extend the configuration to provide not only a writable portion of the filesystem for placing files but a writable database as well, where this new Adapter can create, read, and append to tables.
The work can begin by defining a self-contained Adapter class and testing it. Then integration with Tiled, including this configuration file, can follow. The Adapter will look like:
The job of this class is to present the same interface (methods and attributes) as the other tabular adapters in Tiled, as defined by https://github.com/bluesky/tiled/blob/7f7329de1b4ab39f502075656102585cdcc35f7c/tiled/adapters/protocols.py#L104-L127
It will return pandas DataFrames in
read()
andread_partition(...)
. It will consume pandas DataFrames inwrite(...)
,write_partition(...)
andappend(...)
(or whatever). Internally, it use SQL queries to fetch data and write data. SQLAlchemy could be used to this. It is "Pythonic" and it supports the SQL backends we care about (SQLite for small deployments and PostgreSQL for scaled deployments). However, SQLAlchemy operates row-wise on the data. We would need to decompose the DataFrame from columnar memory-efficient structures in Python tuples, one per row. This is not efficient. ADBC enables us to operate directly on pyarrow objects, which we translate to and from pandas DataFrames object without memory copies.ADBC it does not currently support variable-length lists, which is a rare but important case that we need to cover. It seems that support should be possible and may be added upstream soon. We can proceed to use ADBC for most data and either (1) hope that support for
LIST
is added in time and (2) fall back to using SQLAlchemy for this edge case if it is not ready in time.With either ADBC or SQLAlchemy, we have to decide how to organize the data in tables.
motor FLOAT, detector FLOAT, sample_position INT
).Let us evaluate the trade-offs:
FLOAT
and later realize that it should beINT
or vice versa. Under this system, a given column can only have one data type, and so there is no clean way to recover from this.If we go with (3), this is how it might work:
There should be a specially-named column that identifies which logical dataset in Tiled a given row belongs to, so that when data is read we can do (in ADBC or SQLAlchemy)
SELECT * FROM {hash_table_name} WHERE special_column={dataset_id}
.In the example of a Bluesky experiment, the data flow is:
Bluesky RunEngine -> Tiled client -> Tiled server -> SQLAdapter -> SQL