databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.86k stars 751 forks source link

add support for Fabric Onelake #13421

Closed djouallah closed 1 year ago

djouallah commented 1 year ago

trying to write to Onelake Storage, I get errors

!pip install databend                    > /dev/null 2>&1
!pip install duckdb                      > /dev/null 2>&1
import duckdb
import pathlib
import os
os.environ["DATABEND_DATA_PATH"] = "/lakehouse/default/Files"
sf = 1
for x in range(0, sf) :
  con=duckdb.connect()
  con.sql('PRAGMA disable_progress_bar;SET preserve_insertion_order=false')
  con.sql(f"CALL dbgen(sf={sf} , children ={sf}, step = {x})")
  for tbl in ['nation','region','customer','supplier','lineitem','orders','partsupp','part'] :
     pathlib.Path(f'{sf}/{tbl}').mkdir(parents=True, exist_ok=True)
     con.sql(f"COPY (SELECT * FROM {tbl}) TO '{sf}/{tbl}/{x:02d}.parquet' ")
  con.close()
from databend import SessionContext
ctx_bend = SessionContext("TPCH")
pwd = os.getcwd()
for tbl in ['nation','region','customer','supplier','lineitem','orders','partsupp','part'] :
    ctx_bend.sql(f"""drop table IF EXISTS {tbl}""").collect()
    print(f" create {tbl}")
    ctx_bend.sql(f"""CREATE TABLE IF NOT EXISTS {tbl} AS SELECT * FROM 'fs://{pwd}/{sf}/{tbl}/' (pattern => '.*.parquet' ) """).collect()
    ctx_bend.sql(f"""analyze table {tbl}""").collect()
    print(ctx_bend.sql(f"""select count(1) from  {tbl}""").collect())
BohuTANG commented 1 year ago
!pip install databend                    > /dev/null 2>&1
import os
os.environ["DATABEND_DATA_PATH"] = "/lakehouse/default/Files"

from databend import SessionContext
ctx_bend = SessionContext("TPCH")
pwd = os.getcwd()
ctx_bend.sql(f"""create table if not exists t1(a int)""").collect()
ctx_bend.sql(f"""insert into t1 values(1)""").collect()
print(ctx_bend.sql(f"""select * from t1""").collect())

Error:

thread '<unnamed>' panicked at src/meta/sled-store/src/db.rs:47:34:
open global sled::Db: Io(NotFound, "io error")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[8], line 9
      5 import os
      6 os.environ["DATABEND_DATA_PATH"] = "/lakehouse/default/Files"
----> 9 from databend import SessionContext
     10 ctx_bend = SessionContext("TPCH")
     11 pwd = os.getcwd()

File ~/cluster-env/trident_env/lib/python3.10/site-packages/databend/__init__.py:1
----> 1 from .databend import *
      3 __doc__ = databend.__doc__
      4 if hasattr(databend, "__all__"):

PanicException: open global sled::Db: Io(NotFound, "io error")

Does the Databend python binding we can not do write? @sundy-li

djouallah commented 1 year ago

Python do write, but here I am trying to write to a remote storage, Fabric notebook uses blob fuse

BohuTANG commented 1 year ago

The latest version Databend now works in Fabric with the code:

!pip install databend                    > /dev/null 2>&1
import os
os.environ["DATABEND_DATA_PATH"] = "/lakehouse/default/Files"

from databend import SessionContext
ctx_bend = SessionContext("TPCH")
pwd = os.getcwd()
ctx_bend.sql(f"""create table if not exists t1(a int)""").collect()
ctx_bend.sql(f"""insert into t1 values(1)""").collect()
print(ctx_bend.sql(f"""select * from t1""").collect())

Does this address this issue?

djouallah commented 1 year ago

this is perfect !!!

djouallah commented 1 year ago

@BohuTANG it seems os.environ["CACHE_DATA_CACHE_STORAGE"] = "disk" does not work, I thought disk cache should be supported for native storage ?

BohuTANG commented 1 year ago

Disk cache should be the local disk.

djouallah commented 1 year ago

yes, but it does not seems to be working ? second run don't make much difference

BohuTANG commented 1 year ago

I think we do not export cache ENV for python binding. cc @sundy-li

BohuTANG commented 1 year ago

This looks work for me:

!pip install databend                    > /dev/null 2>&1

import os
from os import listdir
os.environ["DATABEND_DATA_PATH"] = "/lakehouse/default/"
os.environ["CACHE_DATA_CACHE_STORAGE"] = "disk"

from databend import SessionContext
ctx_bend = SessionContext("TPCH")
pwd = os.getcwd()
ctx_bend.sql(f"""create table if not exists t1(a int)""").collect()
ctx_bend.sql(f"""insert into t1 values(1)""").collect()
print(ctx_bend.sql(f"""select * from t1""").collect())
print(ctx_bend.sql(f"""select * from system.configs where name like '%cache%'""").collect())
listdir('./.databend/_cache')

Restart the Fabric session and run:

image
BohuTANG commented 1 year ago

@BohuTANG it seems os.environ["CACHE_DATA_CACHE_STORAGE"] = "disk" does not work, I thought disk cache should be supported for native storage ?

Local disk cache now works but only for FUSE native table.