apache / incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.
https://wayang.incubator.apache.org/
Apache License 2.0
184 stars 73 forks source link

CardinalityRepository is unusable #411

Closed juripetersen closed 7 months ago

juripetersen commented 7 months ago

As of now, the code in the CardinalityRepository is commented. However, especially for ML use cases, sampling and storing measured cardinalities can be an essential step.

I think we should try to fix the code to provide this functionality to users.

juripetersen commented 7 months ago

Desirable output can resemble this:

{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":206,"upperBound":206,"confidence":1.0}],"operator":{"class":"org.apache.wayang.spark.operators.SparkFlatMapOperator"},"output":{"name":"out","index":0,"cardinality":1759}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":206,"upperBound":206,"confidence":1.0}],"operator":{"class":"org.apache.wayang.basic.operators.FlatMapOperator"},"output":{"name":"out","index":0,"cardinality":1759}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":206,"upperBound":206,"confidence":1.0}],"operator":{"class":"org.apache.wayang.java.operators.JavaFlatMapOperator"},"output":{"name":"out","index":0,"cardinality":1759}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1759,"upperBound":1759,"confidence":1.0}],"operator":{"class":"org.apache.wayang.basic.operators.FilterOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1759,"upperBound":1759,"confidence":1.0}],"operator":{"class":"org.apache.wayang.spark.operators.SparkFilterOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1759,"upperBound":1759,"confidence":1.0}],"operator":{"class":"org.apache.wayang.java.operators.JavaFilterOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.basic.operators.MapOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.spark.operators.SparkMapOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.java.operators.JavaMapOperator"},"output":{"name":"out","index":0,"cardinality":1611}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.basic.operators.ReduceByOperator"},"output":{"name":"out","index":0,"cardinality":493}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.spark.operators.SparkReduceByOperator"},"output":{"name":"out","index":0,"cardinality":493}}
{"inputs":[{"name":"in","index":0,"isBroadcast":false,"lowerBound":1611,"upperBound":1611,"confidence":1.0}],"operator":{"class":"org.apache.wayang.java.operators.JavaReduceByOperator"},"output":{"name":"out","index":0,"cardinality":493}}
juripetersen commented 7 months ago

Requires configuration like this:

Configuration config = new Configuration();
config.setProperty("wayang.core.log.enabled", "true");
config.setProperty("wayang.core.log.cardinalities", filePath);
config.setProperty("wayang.core.optimizer.instrumentation", "org.apache.wayang.core.profiling.FullInstrumentationStrategy");