CivicSpleen / ambry

A comprehensive data package manager
BSD 2-Clause "Simplified" License
4 stars 5 forks source link

Postgres FDW / Sqlite Virtual Table for PartitionMsgpackDataFile #99

Closed ericbusboom closed 9 years ago

ericbusboom commented 9 years ago

Write a Foreign Data Wrapper for that can read and query PartitionMsgpackDataFile format files. The goal is to be able to attach these files to a warehouse without actually loading them in. It should be possible to write queries with them, and combine them together into views.

There should also be code that will generate the FDW creation DDL (CREATE FOREIGN TABLE), including the proper schema for the table for the specific data file.

Also implement the Qual object to do efficient filtering on the returned values.

Pay particular attention to the lifesystem of the FDW, and the creation of server processes associated with them. The system must be able to handles hundred to thousands of wrappers simultaneously.

Make sure it works with Sqalchemy: http://multicorn.org/sqlalchemy_fdw/. Use the sqalchemy FDW support for creating the FDW in the warehouse.

Create the FDW servers in a schema in the library database, parallel to the ambrylib schema, in a schema named 'partitions'

Similarly, there virtual tables can be developed for Sqlite, using http://apidoc.apsw.googlecode.com/hg/vtable.html

nmb10 commented 9 years ago
Q. How can I find schema of the partition? PartitionMsgpackDataFile has data only (header and rows).

A. The schema is the partitions table and columns. For partition p, p.table and p.table.columns. The table schema is available after running the meta phase.

Q. What is Qual object?

A. It’s referenced in the documentation for Multicorn. http://multicorn.org/implementing-an-fdw/

Q. I can't find any *.msg files. Where will those files be located?

A. After building a bundle,

        partitions = list(b.partitions)
        p0 = partitions[0]
        datafile  = p0.datafile
        system_path = datafile.syspath

p0.datafile will only work if the partition is fetched from the bundle, and specifically, if bundle.wrap_partition() is called on the partition.

Q. What is the typical use case of the feature?

A. The typical case will be user creates a VIEW in the warehouse that uses table data, and rather than installing the data in the warehouse as was done pre 1.0, in this case, the view uses the partition data from a FDW. A lot of the time, we’ll create a MATERIALIZED VIEW that references the foreign table, and users will use the materialized view, to allow faster access, and ( I hope ) indexes.

Q. Where should I put new code?

A. Delete everything under ambry/warehouse and put it as a module there.

nmb10 commented 9 years ago

Q. How can FDW recognize compressed file? Should it read compressed or raw or both? A. For now, just assume they are compressed

nmb10 commented 9 years ago

Both implementations moved to https://github.com/CivicKnowledge/ambry_sources . See CivicKnowledge/ambry_sources#1 for further progress.