apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.13k stars 2.13k forks source link

Support plaintext Data (CSV, TSV, etc.) in Iceberg Tables #118

Closed mccheah closed 7 months ago

mccheah commented 5 years ago

In addition to ORC and Parquet, we propose to add support for plaintext data in Iceberg. There's always a lot of questions to ask in this space - hopefully we can leverage some existing work in Spark and other places.

rdblue commented 5 years ago

Some of the guarantees made by Iceberg can't be satisfied by plain-text formats like CSV and JSON. The intent was for Iceberg tables to work the same way no matter what format is used to store the data. Here are a few problems:

Basically, these formats aren't suitable for tables that make guarantees about schema evolution or have features like splittability.

We could add a different mode to support some of these, but I don't see enough value in it. I think the right path is to use a Spark or Hive table for CSV data and load it into an Iceberg table for long-term storage to get reliability and performance.

mccheah commented 5 years ago

Is there a place for Iceberg to support CSV datasets if the metadata is frozen? e.g. no schema evolution, metadata about delimiters and such has to remain static?

mccheah commented 5 years ago

Mainly because it would be nice to not branch off how we read our data based on file format - if text to use a Spark text file source, if Parquet / ORC to use Iceberg temp tables. We could do it - probably would decide how to choose the implementation at the DSv2 TableCatalog level? But it would be ideal if we could unify everything.

rdblue commented 5 years ago

Some people have suggested adding a mode for iceberg tables where there is no schema to store unstructured data. I think that would help here. We could start by considering the data unstructured.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 7 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'