awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

Fix Breeze dependency conflict in Anomaly Detection Spark 3.4+ #545

Closed zeotuan closed 2 months ago

zeotuan commented 4 months ago

Update breeze version to 2.1 to match with current spark-mlib 3.4 and spark-mlib 3.5 breeze dependency version. This would allow people migrating to spark 3.4+ to use anomaly detection without dependency conflict issue that is mentioned in https://github.com/awslabs/deequ/issues/336 https://github.com/awslabs/deequ/issues/393 https://github.com/awslabs/deequ/issues/428 https://github.com/awslabs/deequ/issues/428 Also Breeze 0.13.2 has several security vulnerabilities which was solve in breeze 2.1.0

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zeotuan commented 4 months ago

Hi @rdsharma26, what do you think about updating Breeze version? I wonder if there are other workaround to make Anomaly Detection works on more modern version of spark?

rdsharma26 commented 4 months ago

The change looks good. Let me get back to you after understanding how this change affects our internal Spark 3.3 / 3.1 branches.

zeotuan commented 3 months ago

Hi @rdsharma26, I just want to check the status of this. Are there any things I can help with (testing 3.3, 3.1, etc.)

rdsharma26 commented 3 months ago

@zeotuan Apologies for the delayed response. Would it be possible for you to check how this change works against the 2.0.0-spark-3.1-minor and spark-3.3 branches? Does mvn clean install work when you cherry pick these changes on to those branches?

zeotuan commented 2 months ago

Hi @rdsharma26, breeze 2.1.0 is not compatible with spark-3.3 and 2.0.0-spark-3.1-minor spark 3.3 rely on breeze 1.2 spark 3.1 rely on breeze 1.0 Updating to these versions on those image works. Maybe that would require separate PR to fix anomaly detection issue on those versions.

chenliu0831 commented 2 months ago

I think this would fix PyDeequ's upgrade to PySpark 3.4 as well, see errors related to breeze here https://github.com/awslabs/python-deequ/actions/runs/8886301683/job/24399475419?pr=203

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o238.run.
E                   : java.lang.NoSuchMethodError: 'breeze.generic.UFunc$UImpl2 breeze.linalg.DenseVector$.canSubD()'
E                       at com.amazon.deequ.anomalydetection.BaseChangeStrategy.diff(BaseChangeStrategy.scala:65)
E                       at com.amazon.deequ.anomalydetection.BaseChangeStrategy.diff$(BaseChangeStrategy.scala:58)
E                       at com.amazon.deequ.anomalydetection.AbsoluteChangeStrategy.diff(AbsoluteChangeStrategy.scala:33)
E                       at com.amazon.deequ.anomalydetection.BaseChangeStrategy.detect(BaseChangeStrategy.scala:90)
E                       at com.amazon.deequ.anomalydetection.BaseChangeStrategy.detect$(BaseChangeStrategy.scala:80)
E                       at com.amazon.deequ.anomalydetection.AbsoluteChangeStrategy.detect(AbsoluteChangeStrategy.scala:33)
E                       at com.amazon.deequ.anomalydetection.AnomalyDetector.detectAnomaliesInHistory(AnomalyDetector.scala:98)
E                       at com.amazon.deequ.anomalydetection.AnomalyDetector.isNewPointAnomalous(AnomalyDetector.scala:60)
E                       at com.amazon.deequ.checks.Check$.isNewestPointNonAnomalous(Check.scala:1354)
E                       at com.amazon.deequ.checks.Check.$anonfun$isNewestPointNonAnomalous$1(Check.scala:583)
E                       at scala.runtime.java8.JFunction1$mcZD$sp.apply(JFunction1$mcZD$sp.java:23)
E                       at com.amazon.deequ.constraints.AnalysisBasedConstraint.runAssertion(AnalysisBasedConstraint.scala:108)
E                       at com.amazon.deequ.constraints.AnalysisBasedConstraint.pickValueAndAssert(AnalysisBasedConstraint.scala:74)
E                       at com.amazon.deequ.constraints.AnalysisBasedConstraint.$anonfun$evaluate$2(AnalysisBasedConstraint.scala:60)
E                       at scala.Option.map(Option.scala:230)
E                       at com.amazon.deequ.constraints.AnalysisBasedConstraint.evaluate(AnalysisBasedConstraint.scala:60)
E                       at com.amazon.deequ.constraints.ConstraintDecorator.evaluate(Constraint.scala:60)
E                       at com.amazon.deequ.checks.Check.$anonfun$evaluate$1(Check.scala:1246)
E                       at scala.collection.immutable.List.map(List.scala:293)
E                       at com.amazon.deequ.checks.Check.evaluate(Check.scala:1246)
E                       at com.amazon.deequ.VerificationSuite.$anonfun$evaluate$1(VerificationSuite.scala:269)
E                       at scala.collection.immutable.List.map(List.scala:293)
E                       at com.amazon.deequ.VerificationSuite.evaluate(VerificationSuite.scala:269)
E                       at com.amazon.deequ.VerificationSuite.doVerificationRun(VerificationSuite.scala:132)
E                       at com.amazon.deequ.VerificationRunBuilder.run(VerificationRunBuilder.scala:172)
E                       at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                       at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
E                       at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                       at java.base/java.lang.reflect.Method.invoke(Method.java:568)
E                       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
E                       at py4j.Gateway.invoke(Gateway.java:282)
E                       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                       at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                       at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E                       at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E                       at java.base/java.lang.Thread.run(Thread.java:840)