delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.17k stars 1.62k forks source link

[Storage System] Support for IBM Cloud Object Storage #302

Closed guykhazma closed 3 years ago

guykhazma commented 4 years ago

This PR adds support for IBM Cloud Object Storage (IBM COS) by creating COSLogStore which extends the HadoopFileSystemLogStore and relies on IBM COS ability to handle atomic writes using Etags.

The support for IBM COS relies on the following properties:

  1. Write on COS is all-or-nothing, whether overwrite or not.
  2. List-after-write is consistent.
  3. Write is atomic when using the Stocator - Storage Connector for Apache Spark (v1.1.1+) and setting the configuration fs.cos.atomic.write to true.

In addition I propose the following documentation to be added to the Storage Configuration page.

databricks-cla-assistant commented 4 years ago

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

guykhazma commented 4 years ago

The build fails because the license comment in the new class doesn't contain the copyright line. Since the project was donated to Linux Foundation I wasn't sure which copyright to use.

zsxwing commented 4 years ago

@guykhazma Thanks a lot for your contribution. As we don't have an environment to test it, could you clarify what tests you have done in a real environment?

guykhazma commented 4 years ago

@zsxwing thanks for reviewing. I have tested the LogStore implementation directly to verify it achieves the mutual exclusion requirement by running multiple writes to the same location in threads and verifying that only one succeeds. The two other requirements are properties of the storage system so they should be OK as well.

guykhazma commented 4 years ago

Hi @zsxwing, can you please review the PR.

guykhazma commented 4 years ago

@zsxwing gentle ping, @tdas @mukulmurthy I'll appreciate your review as well.

guykhazma commented 4 years ago

Hi @zsxwing, @tdas, @rahulsmahadev, I have updated the PR to include the new copyright title and the build is successful now. Can you please review this PR. Thanks!

guykhazma commented 4 years ago

@zsxwing, @tdas, @rahulsmahadev can you please review the PR

zsxwing commented 4 years ago

Sorry for the delay. We are working on adding a contrib directory and a new project to put all future LogStore implementations like this. We would like to keep the core project small because special LogStore implementations are not needed by everyone. Will report back when the project is ready.

guykhazma commented 3 years ago

Hi @zsxwing, @tdas, @rahulsmahadev, IBM COS supports now conditional headers on multipart upload requests as well (see here) so I have updated the PR to remove the checks that made sure the write is not using multipart upload.

Now that version 0.7.0 was released can you please take a look and see if this can get in. Thanks

robbyki commented 3 years ago

Any timeline on when this might get into master?

tdas commented 3 years ago

Hey @guykhazma

Our apologies for sitting on the PR for so long. We wanted to figure out what is the best we integrate slightly less battle-tested features into our repo, and we finally figured it out. We are currently in the process of converting our repo to a multi-module SBT project so that we can add other sub-modules (mapping to other maven release artifacts beyond delta-core). Then we can add a contribs module and then accept such new and experimental features.

Are you still interested in contributing to this feature?

guykhazma commented 3 years ago

Hi @tdas ,

Thanks for getting back to me. Yes I would love for this to get in. I think it will be great to enable Delta Lake using IBM Cloud Object Storage. Let me know when it's ready and I'll do the necessary refactor to the PR.

tdas commented 3 years ago

Awsome! thank you! I will ping you when this refactor is merged.

tdas commented 3 years ago

All right! The multi-module refactor has been merged. Let's discuss the next steps forward to get this ready for merging.

  1. Please define a new module using the same pattern as the core module in the build.sbt file.

    • Does the IBM support need additional maven dependencies? I am guessing not. In that case, we can make the module name generic like "delta-contribs" (that is, not specific to IBM). Other such zero-dependency contributions can go in that project as well. Otherwise, lets put it in a specifically named module like "delta-contribs-ibm"
  2. Move the new log store code inside the <new-module-dir>/src/main/<package>/... and the test code in a separate file in <new-module-dir>/src/test/<package>/....

Let me know if you need help to get it working.

guykhazma commented 3 years ago

Thanks for the update @tdas. I will update the PR and will tag you when it's ready.

guykhazma commented 3 years ago

@tdas I have made the changes but didn't change the documentation yet. I guess we want to add to the README the new multi-module so I can suggest something but was not sure if you already something in mind.

guykhazma commented 3 years ago

Hi @tdas ,

Any updates?

Thanks!

tdas commented 3 years ago

My apologies for the delay, we have been trying to line up a few more features like a new, more stable LogStore API that LogStore builders like you folks can use. However, there are a few more pieces that we still need to get in for the public API to be easily usable. I dont want you folks to block on those. So we can continue working using the existing LogStore API, which we will continue supporting for at least the 1.x releases. Later we can rewrite the IBMLogStore to use the new API, we are not going to worry about the 1.0 release we are targeting in a couple of weeks.

Let me leave some quick comments

guykhazma commented 3 years ago

@tdas thanks for the update and review. I made the needed changes.

Also, let me know when you want to have the change to use the new API and I'll open a PR for that.

Thanks

tdas commented 3 years ago

@guykhazma thank you for the speedy response. regarding the new LogStore APIs, I think we have to first convert the HadoopLogStore base class to use the new APIs, only then can the subclasses extending HadoopLogStore also use the new APIs easily.

It can be done after this PR. If you are interested, you can take a crack at it if you want!

guykhazma commented 3 years ago

@tdas thanks for the review! I can try and help with converting the HadoopLogStore base class to use the new APIs sometime in the coming weeks.

tdas commented 3 years ago

awesome. this looks good now. i will start the process of merging this! thank you for the super quick response!

tdas commented 3 years ago

@guykhazma question for you! what is the git email address that you want to use for the commits? Linux foundation guideline requires having an email address tied to each contribution. and your github profile does not have any email on it, and email on the commits are "33684427+guykhazma@users.noreply.github.com" which is not a public email for contacting.