Closed FastLee closed 6 months ago
Supported Spark DataSource by SYNC:
Delta
, Parquet
, CSV
, JSON
, ORC
, TEXT
, AVRO
Not supported Spark DataSource by UC and SYNC:
BINARYFILE
, JDBC
, LIBSVM
, custom implementation of org.apache.spark.sql.sources.DataSourceRegister
All Hive Serde table are not supported by SYNC.
table.provider | Hive Serde(row format) and file format | migration strategy |
---|---|---|
BINARYFILE | NA | 1. By default CTAS to Delta 2. Prompt If user want to keep the original file format instead of writing their binary content into parquet file, if so do not migrate |
JDBC | NA | 1. Do not migrate right now 2. In the future, migrate to Lakehouse Federation, if no supported federation connector consider view based solution |
LIBSVM | NA | Do not migrate |
custom implementation of DataSourceRegister | NA | Do not migrate |
HIVE | inputFormat=OrcInputFormat outputFormat=OrcOutputFormat serde=OrcSerde |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using ORC ... location ... |
HIVE | inputFormat=MapredParquetInputFormat outputFormat=MapredParquetOutputFormat serde=ParquetHiveSerDe |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using PARQUET ... location ... |
Hive | inputFormat=AvroContainerInputFormat outputFormat=AvroContainerOutputFormat serde=AvroSerDe |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using AVRO ... location ... |
Hive | inputFormat=SequenceFileInputFormat outputFormat=SequenceFileOutputFormat serde=LazySimpleSerDe |
CTAS to Delta |
Hive | inputFormat=RCFileInputFormat outputFormat=RCFileOutputFormat serde=LazyBinaryColumnarSerDe |
CTAS to Delta |
Hive | inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=LazySimpleSerDe |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ... need to get the field and line delimiter from HMS table metadata and set it accordingly in UC CSV table, also disable the quote. If the HMS table storage properties contains escape.delim , mapkey.delim , colelction.delim , serialization.format which are unsupported, do CTAS delta |
Hive | inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=RegexSerDe |
CTAS to Delta |
Hive | inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=JsonSerDe |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using JSON ... location ... need test |
Hive | inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=OpenCSVSerde |
1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ... |
Hive | All other non native serdes | CTAS to Delta, if failed skip it |
orc
, parquet
, avro
can be found here@qziyuan isn't table format already there?
It looks like we have to pre-empt this decision making into create_table_mapping CSV
@qziyuan isn't table format already there?
@nfx For Hive Serde table, the current table format, derived from table.provider, will all be "HIVE". So we need extra info for serde, input/output format to differentiate them.
It looks like we have to pre-empt this decision making into create_table_mapping CSV
We could either
Is there an existing issue for this?
Problem statement
Tables that are not one of the supported table format for the sync command are not currently migrated to UC.
Fine-grained:
Related issues:
Proposed Solution
Allow users to migrate unsupported type, by converting these to Delta.
Additional Context
No response