databrickslabs / ucx

Automated migrations to Unity Catalog
Other
233 stars 80 forks source link

[FEATURE]: Migrate external tables not supported by the "sync" command #889

Closed FastLee closed 6 months ago

FastLee commented 9 months ago

Is there an existing issue for this?

Problem statement

Tables that are not one of the supported table format for the sync command are not currently migrated to UC.

Fine-grained:

Related issues:

Proposed Solution

Allow users to migrate unsupported type, by converting these to Delta.

Additional Context

No response

qziyuan commented 7 months ago

Migration strategy:

table.provider Hive Serde(row format) and file format migration strategy
BINARYFILE NA 1. By default CTAS to Delta
2. Prompt If user want to keep the original file format instead of writing their binary content into parquet file, if so do not migrate
JDBC NA 1. Do not migrate right now
2. In the future, migrate to Lakehouse Federation, if no supported federation connector consider view based solution
LIBSVM NA Do not migrate
custom implementation of DataSourceRegister NA Do not migrate
HIVE inputFormat=OrcInputFormat
outputFormat=OrcOutputFormat
serde=OrcSerde
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using ORC ... location ...
HIVE inputFormat=MapredParquetInputFormat
outputFormat=MapredParquetOutputFormat
serde=ParquetHiveSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using PARQUET ... location ...
Hive inputFormat=AvroContainerInputFormat
outputFormat=AvroContainerOutputFormat
serde=AvroSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using AVRO ... location ...
Hive inputFormat=SequenceFileInputFormat
outputFormat=SequenceFileOutputFormat
serde=LazySimpleSerDe
CTAS to Delta
Hive inputFormat=RCFileInputFormat
outputFormat=RCFileOutputFormat
serde=LazyBinaryColumnarSerDe
CTAS to Delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=LazySimpleSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ... need to get the field and line delimiter from HMS table metadata and set it accordingly in UC CSV table, also disable the quote. If the HMS table storage properties contains escape.delim, mapkey.delim, colelction.delim, serialization.format which are unsupported, do CTAS delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=RegexSerDe
CTAS to Delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=JsonSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using JSON ... location ... need test
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=OpenCSVSerde
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ...
Hive All other non native serdes CTAS to Delta, if failed skip it

Changes required

  1. Table crawler need to crawl hive serde and file format info.
  2. Table class should store hive serde and file format info..

Reference:

nfx commented 7 months ago

@qziyuan isn't table format already there?

nfx commented 7 months ago

It looks like we have to pre-empt this decision making into create_table_mapping CSV

qziyuan commented 7 months ago

@qziyuan isn't table format already there?

@nfx For Hive Serde table, the current table format, derived from table.provider, will all be "HIVE". So we need extra info for serde, input/output format to differentiate them.

It looks like we have to pre-empt this decision making into create_table_mapping CSV

We could either