NVIDIA / spark-rapids-examples

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.
Apache License 2.0
118 stars 50 forks source link

Automatic conversion of DBFS paths and CSP detection in Databricks Notebook for Tools #405

Closed parthosa closed 1 month ago

parthosa commented 2 months ago

Fixes #404 and #407.

Currently, Databricks Notebooks for tools support only the File API format for event logs stored in DBFS (i.e., /dbfs/path/to/eventlog). Additionally, these notebooks require users to select CSP from a dropdown widget.

Changes

This PR adds the following functionalities:

  1. Automatically convert event log paths passed in the Spark API format (i.e. dbfs:/path/to/log) to the File API format (/dbfs/path/to/log).
    • This ensures that the tool can process the event logs in both formats, thus enhancing usability.
  2. Automatically detect CSP from spark configs (using property "spark.databricks.clusterUsageTags.cloudProvider")

Minor Changes:

How to evaluate

nvliyuan commented 2 months ago

same dead link issue

ERROR: 1 dead links found!
[✖] https://gust.dev/r/xgboost-agaricus → Status: 404

FILE: ./docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md
[✓] https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar → Status: 200
[/] /docs/get-started/xgboost-examples/building-sample-apps/python.md → Status: 0
[/] /docs/get-started/xgboost-examples/dataset/mortgage.md → Status: 0
[✓] https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page → Status: 200
[✖] https://gust.dev/r/xgboost-agaricus → Status: 404

5 links checked.

ERROR: 1 dead links found!
[✖] https://gust.dev/r/xgboost-agaricus → Status: 404