Support Big Data formats in CREATE TABLE EXTERNAL FILE [CORE5663]

firebird-automations commented 7 years ago

Submitted by: Juarez Rudsatz (juarezr)

Is duplicated by CORE5818

Votes: 2

With little effort, firebird could be extended for covering many big data processing cenarios.

Basically Big Data processing is done in two ways: - Batch: a program using a big data batch framework reads data from structured storage sources, converts for a programing format like object/struct (properties) or dataset/dataframe (rows/cols), makes several transformations like map, reduce, join, group by, filter and writes the output to a new structured storage. - Streaming: a program using a streaming framework reads data from realtime ou buffered sources and writes to other realtime/buffered destinations or to a structured storage.

Batch frameworks commonly used are Hadoop, Spark, Pig and several others. Streaming frameworks commonly used are Spark streaming, Kafka, Amazon Kinesis, Amazon Firehose, etc... Structured sources can be database data acessed by jdbc or files accessed from network drives, Hadoop Hdfs filesystems, AWS S3 filesystems or Azure Storage filesystems.

Usually the processed data is consumed by: a) directly exporting to a spreadsheet (csv) in a ad-hoc manner b) uploaded to a database or datawarehouse/BI infrastructure c) stored in a pre-summarized format in a structured source for further processing or analysis

Tools used for analysis in cenario c), besides Batch frameworks are: Apache Hive, Amazon Athena, Amazon Spectrum. They basically provides a mecanism to to query files stored in structured sources like Amazon S3, using plain SQL or PIG languages.

Firebird could take a slice of this market just adding some basic support for this workflow.

For performing well in this cenario, firebird should: 1) have a very fast data injection/bulk insert like Amazon Redshift COPY command (Postgresql columnar clone) 2) support the file formats commonly used in big data like: CSV/TSV, Avro, Parquet, ORC, Grok, RCFile, RegexSerDe, SequenceFile 3) extend EXTERNAL FILE for reading these formats from remote structured sources like the cited above.

These can be done by specifying a FORMAT to the CREATE TABLE EXTERNAL FILE existing command. Most of these formats and filesystems have libraries which can be used for speeding the development. Same way, one could start with the most used formats (CSV/TSV, Parquet, Avro) and most used filesystems (AWS S3, Azure Storage).

firebird-automations commented 7 years ago

Commented by: @asfernandes

An external selectable procedure can read the data in the way it wants.

firebird-automations commented 7 years ago

Commented by: Sean Leyne (seanleyne)

@Adriano,

I believe the integration that Juarez is looking for is one where any query (vs. fixed SP logic) with PLAN/Optimization could be supported/integrated in Firebird.

firebird-automations commented 7 years ago

Commented by: @asfernandes

A single external (code) procedure can deal with multiple procedures (metadata).

See FB/Java example:

create or alter procedure employee_pgsql returns ( id type of column http://employee.id, name type of column http://employee.name ) external name 'org.firebirdsql.fbjava.examples.fbjava_example.FbJdbc.executeQuery() !jdbc:postgresql:employee|postgres|postgres' engine java as 'select * from employee';

You want to read departments? Just create new metadata:

create or alter procedure department_pgsql returns ( id integer, department_name varchar(30), lead_employee_id integer ) external name 'org.firebirdsql.fbjava.examples.fbjava_example.FbJdbc.executeQuery() !jdbc:postgresql:employee|postgres|postgres' engine java as 'select * from departments';

External code can read return columns and body of the declared metadata.

So one can use:

select * from employee_pgsql; select * from department_pgsql;

With a single generic external procedure.

firebird-automations commented 7 years ago

Commented by: Sean Leyne (seanleyne)

@Adriano,

I realize that External SP could do that, but when dealing with "big data" sets which have multi-million row/terabyte datasets, to externally query and return all columns/rows to the Firebird engine for filtering would be ... sub-optimal to say the least.

{Separately, I see Juarez request as somewhat simplistic view of the problem for some of the file formats requested. ORC files, for example, are not a single file but rather a collection of files which are intended to be processed by "engines" using query semantics, not read in a linear fashion. So, a SQL query in Firebird would need to be decompose into parts which would be sent to appropriate "file format engines" to be evaluated/optimized/performed with the results and coordinated/merged by the Firebird engine.}

firebird-automations commented 7 years ago

Commented by: @asfernandes

Firebird cannot and will not process and optimize foreign queries. Then it's job to external procedures anyway to load and create or pass queries to foreign engines.

firebird-automations commented 7 years ago

Commented by: Juarez Rudsatz (juarezr)

Bom dia, Adriano,

Thank you for pointing to this interesting solution. External (code) procedure could match several of the use cases referred in this tickets. And several of them came to my mind. Maybe we could develop some external procedures and share or somebody could point to some already working. There is any repository tracking open source external procedures?

But in some cases, the solution will hit the wall of performance, especially in big datasets. Consider as example, an IOT scenario where we have 12 TB of sensor data every day spanning 6 years of collected data. And this is growing each month by the number of devices monitored and by the quantity of information/fields stored in each record.

As Sean pointed, the ticket description really is a simplistic view of the problem scope. One thing not mentioned, for example, is that commonly the data is partitioned in folders by some pattern like time historical or regional distribution. And just importing from a single file, would make the feature not viable. One should expect also reading/writing records in a parallel fashion, for decreasing load/stored times.

Summing all of needs is not viable repeat every solution just as case by case basis. Using only external procedure as feature, should not make Firebird a viable option for big data processing, as people should consider other options with this built-in functionality.

Sean also correctly pointed at the difference between file formats. Parquet format, for example, is columnar, not row based. But CSV/TSV/delimited, fixed width and AVRO formats are row based and should be a first low hanging fruit.

Of course, I understand that this is not a tiny/simple enhancement and it needs sponsors, a cost/benefit analysis plan, and the usual measures of a successful project. But I hope this discussion could be a starting point for a growing niche.

firebird-automations commented 6 years ago

Modified by: Sean Leyne (seanleyne)

Link: This issue is duplicated by [CORE5818](https://github.com/FirebirdSQL/firebird/issues?q=CORE5818+in%3Atitle) \[ [CORE5818](https://github.com/FirebirdSQL/firebird/issues?q=CORE5818+in%3Atitle) \]

firebird-automations commented 6 years ago

Commented by: @livius2

Can this get higher priority? This i common operation and should be supported ASAP. Maybe this is only my POV but i suppose not.

And please change the title of the report as it is to restricly connected to external table i see this also something like in CORE5818.

FirebirdSQL / firebird

Support Big Data formats in CREATE TABLE EXTERNAL FILE [CORE5663] #5929