NVIDIA / spark-rapids-examples

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.
Apache License 2.0
121 stars 51 forks source link

Some python and notebook versions of examples have diverged #357

Open eordentlich opened 8 months ago

eordentlich commented 8 months ago

Describe the bug Not sure it is the case for all examples, but for the mortgage ETL + XGBoost example there are some non-trivial discrepancies. Example: python script has udfs: https://github.com/NVIDIA/spark-rapids-examples/blob/main/examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/etl.py#L22-L23 while the notebook(s) implement these using Spark SQL directly: https://github.com/NVIDIA/spark-rapids-examples/blob/main/examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb?short_path=2af22cf#L454-L478 There are some other differences. Looks like the scripts may be lagging the notebooks.

Steps/Code to reproduce bug N/A

Expected behavior Notebooks and python script versions should ideally be aligned (or at least documented why they don't).

Environment details (please complete the following information) N/A

GaryShen2008 commented 8 months ago

@nvliyuan Do you remember who wrote these examples? I can't recall the reason, but there should be.

nvliyuan commented 7 months ago

Yes, the same example with different implementations should keep the same logic, will draft a pr to fix it.