apache / polaris

Apache Polaris, the interoperable, open source catalog for Apache Iceberg
https://polaris.apache.org/
Apache License 2.0
1.13k stars 124 forks source link

Spark Jupyter getting started docker compose #295

Closed kevinjqliu closed 3 weeks ago

kevinjqliu commented 1 month ago

Description

This PR moves the docker-compose-jupyter.yml file (and the notebooks/ directory), formerly in the top-level directory, into the getting-started/spark/ folder.

The purpose is to unify the "getting started" guides into the same directory.

Fixes #110

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

docker-compose -f getting-started/spark/docker-compose.yml up

Open the SparkPolaris.ipynb Jupyter notebook Grab the root principal credentials from the Polaris service and replace in the notebook cell. Run all cells in notebook

Checklist:

Please delete options that are not relevant.

kevinjqliu commented 1 month ago

I want to make sure this is something we want to do before proceeding to add more to the PR

cc @collado-mike / @flyrain

flyrain commented 1 month ago

Make sense to me. Thanks @kevinjqliu! Do we have any doc for its usage? We may add doc if not.

kevinjqliu commented 1 month ago

@flyrain yep i'll have a README in here, similar to the trino one

flyrain commented 1 month ago

Sounds good. We will need these doc to be in the Polaris doc site, like this https://polaris.apache.org/docs/overview/. I couldn't find Trino's doc there, this may involve doc publish and link. cc @jbonofre

kevinjqliu commented 1 month ago

I see, this is the README for trino. I'll add a similar README for spark.

As a follow-up, we can change the Polaris doc to refer to these guides https://polaris.apache.org/docs/quickstart

collado-mike commented 1 month ago

This looks good to me. We should change the name of the compose file to just docker-compose.yml so we don't have to specify the filename in the command line :)

kevinjqliu commented 1 month ago

@collado-mike makes sense, will do.

I have a question on slack about unable to assume the role arn:aws:iam::631484165566:role/datalake-storage-integration-role in the notebook, do you mind taking a look?

kevinjqliu commented 1 month ago

r? @flyrain @RussellSpitzer @collado-mike

Also opened #319 to update the Polaris doc site once this is merged.

kevinjqliu commented 1 month ago

md check intermittently shows https://redocly.com/docs/cli/installation as 400 error, weird

flyrain commented 1 month ago

md check intermittently shows https://redocly.com/docs/cli/installation as 400 error, weird

It's OK to remove the link for now since we’re transitioning to Hugo.

kevinjqliu commented 1 month ago

@flyrain just had to run the CI a few times, it's unrelated to this change

kevinjqliu commented 3 weeks ago

Thanks for the review @flyrain, addressed your comments

flyrain commented 3 weeks ago

We cannot merge any PR until #374 is merged.

kevinjqliu commented 3 weeks ago

Thanks for the heads up, I'll rebase once that PR's merged

kevinjqliu commented 3 weeks ago

@flyrain took your advice, moved getting-started/spark/create-polaris-catalog.sh logic into the jupyter notebook. Also rebased off latest main. I think this PR is good to go. Please take a look!

flyrain commented 3 weeks ago

Thanks a lot for working on it, @kevinjqliu! Thanks all for the review.