databricks / spark-perf

Performance tests for Apache Spark
Apache License 2.0
380 stars 203 forks source link

Python mllib tests failing #46

Closed tsailiming closed 9 years ago

tsailiming commented 9 years ago

All tests are failing because of the random seed.

Number of failed tests: 8, failed tests: python-glm-classification,python-glm-classification,python-glm-regression,python-naive-bayes,python-als,python-kmeans,python-pearson,python-spearman

15/01/23 23:17:32 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, numaq1-1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
    process()
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/net/home/ltsai/a/spark-perf.new/pyspark-tests/mllib_data.py", line 21, in gen
    rng = numpy.random.RandomState(hash(str(seed ^ index)))
  File "mtrand.pyx", line 613, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7402)
  File "mtrand.pyx", line 649, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7702)
ValueError: Seed must be between 0 and 4294967295

        at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
        at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
JoshRosen commented 9 years ago

This is due to an interaction between a PySpark bug and NumPy 1.9; see https://github.com/thunder-project/thunder/issues/41 for another report.

It looks like this was fixed in Spark 1.2.0, but not in other branches: https://issues.apache.org/jira/browse/SPARK-3995

Since I think we initialize the seed ourselves, can we fix this in spark-perf by adding a modulus where we set the seed?

tsailiming commented 9 years ago

I'm using NumPy 1.9.1 installed from PIP.

tsailiming commented 9 years ago

I'm using Spark 1.2.0 too.

tsailiming commented 9 years ago

Another error after downgrading to numpy 1.8.2

15/01/26 11:26:59 INFO scheduler.TaskSetManager: Lost task 200.3 in stage 0.0 (TID 261) on executor numaq1-4: org.apache.spark.api.python.PythonException (Tra
ceback (most recent call last):
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
    process()
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/net/home/ltsai/a/spark-perf.new/pyspark-tests/mllib_data.py", line 21, in gen
    rng = numpy.random.RandomState(hash(str(seed ^ index)))
  File "mtrand.pyx", line 574, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:5495)
  File "mtrand.pyx", line 606, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:5712)
OverflowError: can't convert negative value to unsigned long
mengxr commented 9 years ago

60 should fix this with 0xffffffff mask.