there is hardcoding in logistic regression test case when compare model coefficients

NVIDIA / spark-rapids-ml

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs

https://nvidia.github.io/spark-rapids-ml/

Apache License 2.0

67 stars 30 forks source link

there is hardcoding in logistic regression test case when compare model coefficients #402

Open johnnyzhon opened 1 year ago

johnnyzhon commented 1 year ago

https://github.com/NVIDIA/spark-rapids-ml/blob/branch-23.08/python/tests/test_logistic_regression.py#L306C9-L306C68 assert array_equal(coefficients, [-2.42377087, 2.42377087])

two concerns:

where did this basis array [-2.42377087, 2.42377087] come from?
Is that making test case difficult to maintain?

lijinf2 commented 1 year ago

It is the output of cuml single-gpu class. Copying it here to ensure our code (multiple-GPU class) does not break the output.
Seemed fine for now but would like to learn about potential risk or if there is a better way.

johnnyzhon commented 1 year ago

Thanks for @lijinf2 's information. Shall we compare the coefficients between gpu(spark_rapids_ml) model and cpu(pyspark) model? Is there a way to achieve that?

lijinf2 commented 1 year ago

Yeah, there is a way. In fact, that is achieved in the test_compat, which compares the coefficients between GPU model and CPU model on VectorUDT input type. Currently, every algorithm has a similar "test_compat" test case that ensures the compatibility with CPU on the VectorUDT type. I may be wrong but it seems CPU algorithm accepts VectorUDT input type only.

test_toy_example uses array input type, because GPU does not restrict to VectorUDT type. Spark rapids ml supports three types, i.e. VectorUDT, array, multi-cols. Relevant information is in the test_classifier.

test_toy_example demonstrates a use case of using array input type.

lijinf2 commented 1 year ago

Perhaps we should move test_compat up as the first test case in the file test_logistic_regression.py. Let me know if this looks better from a end user point of view.