NVIDIA / spark-rapids-ml

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs
https://nvidia.github.io/spark-rapids-ml/
Apache License 2.0
67 stars 30 forks source link

there is hardcoding in logistic regression test case when compare model coefficients #402

Open johnnyzhon opened 1 year ago

johnnyzhon commented 1 year ago

https://github.com/NVIDIA/spark-rapids-ml/blob/branch-23.08/python/tests/test_logistic_regression.py#L306C9-L306C68 assert array_equal(coefficients, [-2.42377087, 2.42377087])

two concerns:

  1. where did this basis array [-2.42377087, 2.42377087] come from?
  2. Is that making test case difficult to maintain?
lijinf2 commented 1 year ago
  1. It is the output of cuml single-gpu class. Copying it here to ensure our code (multiple-GPU class) does not break the output.
  2. Seemed fine for now but would like to learn about potential risk or if there is a better way.
johnnyzhon commented 1 year ago

Thanks for @lijinf2 's information. Shall we compare the coefficients between gpu(spark_rapids_ml) model and cpu(pyspark) model? Is there a way to achieve that?

lijinf2 commented 1 year ago

Yeah, there is a way. In fact, that is achieved in the test_compat, which compares the coefficients between GPU model and CPU model on VectorUDT input type. Currently, every algorithm has a similar "test_compat" test case that ensures the compatibility with CPU on the VectorUDT type. I may be wrong but it seems CPU algorithm accepts VectorUDT input type only.

test_toy_example uses array input type, because GPU does not restrict to VectorUDT type. Spark rapids ml supports three types, i.e. VectorUDT, array, multi-cols. Relevant information is in the test_classifier.

test_toy_example demonstrates a use case of using array input type.

lijinf2 commented 1 year ago

Perhaps we should move test_compat up as the first test case in the file test_logistic_regression.py. Let me know if this looks better from a end user point of view.