arunramachandran15 / duckdb-sheets

0 stars 0 forks source link

Requirement clarifications #1

Open arunramachandran15 opened 2 months ago

arunramachandran15 commented 2 months ago

Initial problem stated by Arun :

Implement a google sheet like frontend interface using Swelete Send real time updates in the sheets to using a data loader to DuckDB Receive the real time updates using a web server and socket implementation Another user using the sheets should be able to see the updates in realtime and also anybody using the DuckDB for their processing with real time updates. To use docker and Kubernetes, implement DuckDB as statefulset in K8s (For db file persistent storage)

UI / APIs

List all sheets interface (Integrate with GET /sheets api) Sheet interface (Integrate with Get /sheets/sheet-id api) (To fetch the initial sheet data and reconciliation with real time updates) Websocket client to send and receive real time updates from server Mount the duckdb file as a persistent volume in k8s. Since DuckDB is OLAP and columnar storage, I am thinking of using mysql for transactional data storage. I need to do a bit further research on this.

Email comment by Vish

your problem statement is correct one point to add is you have "N" users connected to the same duckdb database file that resides in a k8 pod and any user can read and write to that duckdb file note duckdb is a single process memory mapped columnar database ==> means you need a singleton duckdb manager to do all reads and writes to the duckdb table

Clarification on APIs: you would have to use websockets which means you don't have the GET, PUT, POST, UPDATE, DELETE REST API verbs so all traffic across the web socket

TSQL:

if you did consider postgreSQL: which is multi-user --> but we are only using DuckDB for this experiment don't think about MySQL think postgresql; DuckDB is a postgreSQL compliant database postgresql has a duckdb connector ==> https://github.com/duckdb/pg_duckdb

arunramachandran15 commented 2 months ago

I have a doubt here.

By "n users" do you mean only the "n number of svelte application users accessing the sheets interface" ? or different users accessing from a python script as well, who can directly access the duckdb file for data analysis as well?

vcanaran commented 1 month ago

can you give me access to your readme and design docs

vcanaran commented 1 month ago

I have a doubt here.

By "n users" do you mean only the "n number of svelte application users accessing the sheets interface" ? or different users accessing from a python script as well, who can directly access the duckdb file for data analysis as well?


  1. you have 1 duckdb file running on a Kubernetes POD
  2. the server is python - lets not call it python script it is python code written as a server
  3. when we say "n" users we mean there are many (n) client web applications connecting via web sockets to the same POD.
    • so many users are interacting with the same duckdb file
    • the users on the browser can view the duckdb file in a spreadsheet interface (data grid)
    • the user on the browser can update the duckdb file via the spreadsheet interface (data grid)

      Key points: duckdb is a single process database

    • so only a single duckdb python manager can interact with the duckdb file so a single READ-WRITE connection

      Possible architecture: Svelte Javascript application connects via web sockets

    • Use Tornado Python library to tunnel through proxy to Kubernetes POD (sure you can use Node.Js with SvelteKit and then you will use socket.io versus tornado library) Svelte app is a data table that lists a set of duckdb files you tap on the duckdb file and you see the tables of that duckdb file as tabs in a spreadsheet. you tap on a any tab and you see a data-grid for one of the tables in the duckdb database Each POD is dedicated to single duckdb file ; differernt duckdb file means you connect to a different POD

      The Tornado library tunnels you to the right pod on that pod there is a web socket client manager that listens to the web sockets and drops messages into a queue (yes you can use an ValKey (just an opensource implementation of Redis)

    • The duckdb database manager listens to the queue and processes messages in the queue one at a time (singleton to manage duckdb)
    • Only the cells that change are sent back to the return websocket connection for all users who are listening to changes to that specific dataset (duckdb file) - so it makes sense to have a response broadcast socket connection for all users listening to the same duckdb file

      Best to create a mermaid sequence diagram which i can review and say go ....

      Thanks Arun

vcanaran commented 1 month ago

if you want you can use SvelteKit and NodeJs for the socket server using socket.io

Cleint Svelte (web socket) . ----> Server SvelteKit NodeJs (socketIO)

SERVER SIDE: Server SvelteKit NodeJs (socketIO) ------> ValKey (redis open source) Queue and cache for UI datatables

Python Duckdb Manager (listen to Queue) ----> single connection ---> Duckdb file for pod (Persistent volume on K8 node)

Duckdb --->. Python Duckdb Manager (know your changes)

Python Duckdb Manager --> updates ----> Redis Cache for tables with changes.

Redis Cache (via javascript) ---> send changes to client via a broadcast socket for users listening to the same duckdb file (socket room) SocketIO ----> Svelte App (websocket)


Note: many users have their own socket connection to submit changes all users listen to the same broadcast socket from the server that sends back changes to all users connected to the same duckdb file


i think this is your sequence design in text format

arunramachandran15 commented 1 month ago
image

Please check the rough design diagram @vcanaran . I am skipping the redis part for simplicity and going with just the python solution instead of using a nodejs wrapper.

arunramachandran15 commented 1 month ago

after reviewing the above design I feel, redis queue part is mandatory to ensure reads/writes are serialized. So I am updating the architecture to the below one.

image
vcanaran commented 1 month ago

correct you need a pub/sub to serialize the macros/actions/work into the singleton duckdb manager which is the only process that is servicing the duckdb database file

  1. user 1 is updating a cell via the user 1 websocket::command channel
  2. user 2 then asks to read the next page of data via user 2 ws::command channel

for now they are serial however if you implement a mutex you could serialize updates but interleave reads since reads "could" happen concurrently via the same read-write connection.