codeforkjeff / dbt-sqlite

A SQLite adapter plugin for dbt (data build tool)
Apache License 2.0
77 stars 13 forks source link
dbt elt etl sqlite

dbt-sqlite

A SQLite adapter plugin for dbt (data build tool)

Please read these docs carefully and use at your own risk. Issues and PRs welcome!

The Use Case

SQLite is an embedded SQL database. It comes included with most Python distributions and requires no installation or configuration. It can be a good choice if your project meets any of these criteria:

SQLite can be surprisingly fast, despite the query optimizer not being as sophisticated as other databases and data warehouse platforms. Tip: materialize your models as tables and create indexes in post-hooks to speed up filtering and joins.

How to Use This

Use the right version. Starting with the release of dbt-core 1.0.0, versions of dbt-sqlite are aligned to the same major+minor version of dbt-core.

Install this package:

# run this to install the latest version
pip install dbt-sqlite

# OR run this to install a specific version
pip install dbt-sqlite==1.0.0

Create an entry in your ~/.dbt/profiles.yml file with the following configuration:

dbt_sqlite:

  target: dev
  outputs:
    dev:
      type: sqlite

      # sqlite locks the whole db on writes so anything > 1 won't help
      threads: 1

      # value is arbitrary
      database: "database"

      # value of 'schema' must be defined in schema_paths below. in most cases,
      # this should be 'main'
      schema: 'main'

      # connect schemas to paths: at least one of these must be 'main'
      schemas_and_paths:
        main: '/my_project/data/etl.db'
        dataset: '/my_project/data/dataset_v1.db'

      # directory where all *.db files are attached as schema, using base filename
      # as schema name, and where new schema are created. this can overlap with the dirs of
      # files in schemas_and_paths as long as there's no conflicts.
      schema_directory: '/my_project/data'

      # optional: list of file paths of SQLite extensions to load. see README for more details.
      extensions:
        - "/path/to/sqlean/crypto.so"
        - "/path/to/sqlean/math.so"
        - "/path/to/sqlean/text.so"

Set profile: 'dbt_sqlite' in your project's dbt_project.yml file.

Notes

SQLite Extensions

These modules from SQLean are needed for certain functionality to work:

Precompiled binaries are available for download from the SQLean github repository page. You can also compile them yourself if you want. Note that some modules depend on other libraries (math for example depends on GLIBC); if an extension fails to load, you may want to try building it yourself.

Point to these module files in your profile config as shown in the example above.

Mac OS seems to ship with SQLite libraries that do not have support for loading extensions compiled in, so this won't work "out of the box." Accordingly, snapshots won't work. If you need snapshot functionality, you'll need to compile SQLite/python or find a python distribution for Mac OS with this support.

Development Notes / TODOs

...

Publishing a release to PyPI

Because I forget...

# assumes ~/.pypirc is already set up

workon dbt-sqlite-devel

vi dbt/adapters/sqlite/__version__.py # update version
vi setup.py # update dbt-core dependency if appropriate

# start clean
rm -rf dist/ build/ *.egg-info

# make sure tools are up to date
python -m pip install --upgrade setuptools wheel twine

# build
python setup.py sdist bdist_wheel

# upload to PyPI
python -m twine upload dist/*

git commit
git tag vXXX
git push --tags

# go to github and "Draft a new release"

Running Tests

This runs the test suite and cleans up after itself:

./run_tests_docker.sh

To run tests interactively and be able to examine test artifacts:

docker build . -t dbt-sqlite

docker run --rm -it dbt-sqlite bash

# see output for the locations of artifacts
run_tests.sh -s

Credits

Inspired by this initial work by stephen1000: https://github.com/stephen1000/dbt_sqlite

https://github.com/jwills/dbt-duckdb/ - useful for ideas on working with another embedded database

https://github.com/fishtown-analytics/dbt-spark/ - spark also has two-part relation names (no 'database')