CivicSpleen / ambry

A comprehensive data package manager
BSD 2-Clause "Simplified" License
4 stars 5 forks source link

Add SQL support to Sources #140

Closed ericbusboom closed 8 years ago

ericbusboom commented 8 years ago

This issue depends on #139.

Add a new file to the bundle, using the same ambry.bundles.file facility as the bundle.py, bundle.yaml, etc files, but storing sql. Name it: 'bundle.sql'

This file will store a set of SQL statements, which may include:

The SQL code is parsed and converted to ensure that it can execute on either Sqlite or Postgres, and to create Virtual Tables or FDW servers for the partitions referenced in the sql code.

Use Case

The typical use case is that the data wrangler will create a bundle.sql file with a set of statements, and then reference a single view in the SQL file in the sources.csv file (reftype='sql'm ref= name of table or view ). Then, the build process will create a source that can get rows from the view or query. At the start of the build, the system will load the SQL file and create all of the tables, so they can be referenced during the build.

One SQL file can contain many statements that create tables or views, and there can be multiple lines in the sources.csv file that reference these tables and views.

SQL Parsing.

The SQL file is loaded and parsed with the sqlparser module. First, use the sqlparser to break the SQL file into statements, and process each statement separately.

Here is a typical form of CREATE VIEW:

CREATE VIEW footable AS
SELECT * FROM ingest.example.com-basic-simple AS simple
LEFT JOIN ingest.example.com-stages-stage1 AS stage1 on simple.foo = stage1.bar
LEFT JOIN build.example.com-generators-demo AS demo on simple.foo = demo.bar;

In parsing a statement like this, the most important thing is to identify the partition names associated with FROM and LEFT JOIN and create virtual tables or FDW for them. In all cases, these names will be converted to the VIDs, and the FDW table will be named for the VID of the partition. ( Use library.partition() to look up these names )

The AS clause, to rename the tables, is required. The processing will not interpolate the name of the table anywhere else than the FROM or JOIN clauses.

Dialect specific processing

This feature must re-write the SQL to ensure that it is suitable for either Sqlite or Postgres. In particular, names in Postgres are case sensitive, so vids must be quoted. The system should quote all names.

Sqlite does not have materialized views, so on Sqlite, CREATE MATERIALIZED VIEW is implemented with CREATE TABLE

Indexing

The INDEX statement will create an index, but it has a different form. In SQL, it is CREATE INDEX. Here is it just INDEX.

The form is:

INDEX (*)

Where is a vid, id, vname or name of a partition or table. So:

INDEX pERJQxWUVb00a001 (uuid,foo,bar)

or: INDEX build.example.com-generators-demo (uuid,foo,bar)

INDEX statements get translated to CREATE INDEX with a name based on the tableref and columns.

Notes

The SQL parsing facility will be used by the warehouse feature ( #139) so it should be reusable.

ericbusboom commented 8 years ago

The typical use case is that the data wrangler will create a bundle.sql file with a set of statements, and then reference a single view in the SQL file in the sources.csv file (reftype='sql'm ref= name of table or view ). Then, the build process will create a source that can get rows from the view or query. At the start of the build, the system will load the SQL file and create all of the tables, so they can be referenced during the build.

Is it possible that way? I mean bundle.sql processed in build phase, source.csv is processed in ingest phase. But the ingest phase occurs before build. How can we ingest sources if we didn't run build?

Hmm … you may be right … for some cases.

First, I should mention that I’ve changed the build phase so it runs against the original source file, not the ingested MPR file. The build phase used to ingest first, then use the ingested file to build, now it just uses the original source. The ingest phase still exists, but it is only for use while the bundle is being constructed. The data wrangler will ingest in order to automatically create source and destination schemas, and do other analysis.

So, if a SQL view is being used as a source, it would get run in the Ingest phase in order to get the schema for the query, then again to build the partition.

Most of the time, the .sql file would reference partitions built in other, external, bundles, so they already exist and can be referenced at any phase.

The problem would be when the query references a partition built in an earlier stage. Stages are a bit tricky. Stages are handled automatically during build, but not during ingest. Here is the loop for stages during build:

https://github.com/CivicKnowledge/ambry/blob/master/ambry/bundle/bundle.py#L1818

But, for ingest, the data wrangler needs to ingest and build schemas in the correct order.

So, suppose you have two sources, each in its own stage: “stage1” and “stage2”. “ stage2” uses a SQL query that references stage1. The wrangler would have to:

  1. Ingest stage1
  2. Create schemas for stage1
  3. 3 Build stage1
  4. Ingest stage2
  5. Create schemas for stage2

Then, in the final build, Stage1 would be built first, then stage2.

The problem is that if there is only one SQL file, the queries for stage2 would have to get loaded with stage1, and since the stage1 partitions don't exist, it should raise an error. Let me address that in another comment, since this one's a bit long.

ericbusboom commented 8 years ago

To solve the staging problem, I note that you're already doing fairly low level parsing of the SQL, so we may want to add a bit more.

Suppose you added an "IN STAGE" command, which would prefix the view commands:

IN STAGE 1 CREATE VIEW ... 

Then, when the sql file is loaded, the caller can specify a stage. For stage X, the only SQL statements that are executed are the ones that specify stage X, or have no stage specified.

Also, since the file involves special processing, it would be most consistent to create a new bundle.files.BuildSourceFile for Sql files. I'd suggest the storage should be message pack, like bundle.files.RowBuildSourceFile, but with one one SQL statement or comment per line. Maybe do a little pre-processing, so each line is a record or a dict that includes: type ( comment, index, create, etc ) , stage, sql, comment.

nmb10 commented 8 years ago

Should bundle has access to sources created by its own bundle.sql? Here is an example:

Suppose we have exactly one bundle. simple/bundle.sql

CREATE TABLE table1 (
    col1 INTEGER,
    col2 INTEGER);

simple/sources.csv:

..., description, reftype, ref
..., Data from table defined in bundle.sql, sql, table1
ericbusboom commented 8 years ago

No, a table declared in the sources files does not need to be able to access itself in SQL.

nmb10 commented 8 years ago

Done. Example of the sql (I used asql (ambry sql) in the source code): https://github.com/CivicKnowledge/ambry/blob/develop/test/bundles/example.com/simple_with_sql/bundle.sql

Example of the sources.csv who uses stuff created in the bundle.sql: https://github.com/CivicKnowledge/ambry/blob/develop/test/bundles/example.com/simple_with_sql/sources.csv

Note: I used "create table/insert rows" for sqlite view because it simplifies implementation a lot (reduces amount of pysqlite/apsw connection). If this is unacceptable, please comment and reopen.