apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.46k stars 2.23k forks source link

[Proposal] An iceberg-unstructured module #859

Closed rdsr closed 8 months ago

rdsr commented 4 years ago

I think an Iceberg unstructured module would be great to have. This is useful for datasets which do not have proper schema evolution rules [e.g csv] or are completely binary e.g ML models, but still can use Iceberg feature sets like snapshot isolation and partitioning.

This, of-course cannot be supported by any compute engine, this would be more of a standalone API like Iceberg generics module.

I'm unsure, we need data reading capabilities. As a first version I'd be happy if we only support scan for the read side which can return the files as an Iterable, on the write side we can possibly support appending files to the table.

What do people think, should we build something like this?

jerryshao commented 4 years ago

I like this idea. The features like snapshot isolation are quite useful in different areas.

XiaokunDing commented 4 years ago

Awesome. In ML area, user could save the multiple versions embedded data to Iceberg.

aokolnychyi commented 4 years ago

There was a similar proposal described in #118.

I think it is reasonable to support unstructured data in Iceberg with some limitations (e.g. schema evolution). If we do so, I believe it should be a complete implementation so that people using Spark CSV/JSON sources can migrate to it.

rdblue commented 4 years ago

Sounds like a good idea to me to extract the parts that can be used to track non-table datasets with atomic changes.

I think #118 is slightly different from my discussions with Matt. The idea there is to define an interface for file formats so you can plug in your own. Those interfaces will define capabilities that Iceberg would enforce. So if you use CSV or TSV, then Iceberg would only allow appending and renaming columns.

SinghAsDev commented 3 years ago

I am interested in this as well, is there any work being done in this space?

rdblue commented 3 years ago

We talked about this recently in Slack. An alternative that I like is adding a BLOB type to the spec that is backed by a file. Then each unstructured file is represented by a row in a table that can contain metadata, and isn't just a file reference in an Iceberg manifest file.

I think using a BLOB type is a simpler path to managing unstructured data collections and end up delivering more features.

SinghAsDev commented 3 years ago

I think that makes sense. The readers will then return just the file names? Just curious, what type of metadata would one want to store for each file? Would metadata be a specific type or like a dictionary?

Is the discussion on Apache slack? I could not find the thread with a few keyword searches like blob and unstructured data 😀.

On Thu, Aug 26, 2021 at 3:52 PM Ryan Blue @.***> wrote:

We talked about this recently in Slack. An alternative that I like is adding a BLOB type to the spec that is backed by a file. Then each unstructured file is represented by a row in a table that can contain metadata, and isn't just a file reference in an Iceberg manifest file.

I think using a BLOB type is a simpler path to managing unstructured data collections and end up delivering more features.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/859#issuecomment-906794720, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFQCZPQMGUKSPZUM4GFFODT63ARVANCNFSM4LRCNE2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

  • Ashish
RussellSpitzer commented 3 years ago

I remember this, I forgot to note that I talked with the group internally who wanted this support. They basically had built very similar tooling although were building compressed binaries of all the files in a given addition to the table. So you could build a dataset by indicating which compressed binaries had the files you wanted along with the read offsets.

RussellSpitzer commented 3 years ago

Discussion was in the non-apache Slack

https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1628242403033700

SinghAsDev commented 3 years ago

Ah interesting, just joined the slack channel and found the thread, will follow up there.

Best Regards, Ashish

On Thu, Aug 26, 2021 at 4:46 PM Russell Spitzer @.***> wrote:

Discussion was in the non-apache Slack

https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1628242403033700

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/859#issuecomment-906816031, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFQCZMH3MNL7NKIVVTLSC3T63G5TANCNFSM4LRCNE2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

LLsion commented 1 year ago

Discussion was in the non-apache Slack

https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1628242403033700

I remember this, I forgot to note that I talked with the group internally who wanted this support. They basically had built very similar tooling although were building compressed binaries of all the files in a given addition to the table. So you could build a dataset by indicating which compressed binaries had the files you wanted along with the read offsets.

I want to know whether it is possible to store unstructured data such as pictures or videos through hdfs with iceberg. And it's hard for me to find the relevant API to export the data managed in iceberg. Hope you can answer my questions, thank you.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 8 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'