Auto Materialization of source tables

evidence-dev / evidence

Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown

https://evidence.dev

MIT License

4.27k stars 201 forks source link

Auto Materialization of source tables #1692

Closed archiewood closed 1 month ago

archiewood commented 7 months ago

Feature Description

Ability to get tables directly from source, more concisely, for SQL based databases

connection.yaml

name: prod
type: snowflake
options: 
...

tables:
- orders
- reviews

or even

name: prod
type: snowflake
options: 
...

tables:
- *

Goal of Feature

Faster DX when prototyping
The ability to more quickly specify sources in SQL connectors
faster visibility of the columns in your source database tables

Current Solution / Workarounds

sources/orders.sql

select * from orders

sources/reviews.sql

select * from review

Considerations

This would automatically do a select * which may be undesirable
Especially tables - *

andrejohansson commented 7 months ago

Should likely limit the content also or this is a potential to blow up huge tables, or cost a lot. Eg, imagine the costs of running this on a bigquery db with billions of rows. Could be quite expensive queries.

If you use datagrip or some other editor they usually limit to some fixed size (500 rows) for you automatically behind the scenes when you do stuff.

You could have a default limit, that is override-able

tables:
- orders
  - limit: 500 
- reviews
  - limit: 0 #unlimited rows

archiewood commented 7 months ago

Agree that this could be dangerous and we should think about how not to destroy people's DBs

However also automatic (unspecified) limits can also be confusing and lead to errors in analysis

(Easy to miss a dropped row if there were 505 rows in the table, and you excluded the last 5)

An alternative would be to refuse to run tables above x rows (eg x=100,000) unless the user specifies limit=0 (or rows=all or similar), and throw a warning in this case?

archiewood commented 7 months ago

Although now I think about it, that would require a count(*) operation on the table which is sometimes (depending on DBMS) an expensive operation