dsp-uga / andromeda

This repository contains a Naive Bayes classifier implemented on document classification which is completed on CSCI 8360, Data Science Practicum at the University of Georgia, Spring 2018.
MIT License
4 stars 1 forks source link

[Windows][PyCharm] Error: too many values to unpack #17

Closed nihalsoans91 closed 6 years ago

nihalsoans91 commented 6 years ago

`C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\python.exe C:/Users/nihal/PycharmProjects/p1/p1.py C:\Users\nihal\Desktop\Data\train\X_train_vsmall.txt C:\Users\nihal\Desktop\Data\train\y_train_vsmall.txt C:\Users\nihal\Desktop\Data\test\X_test_vsmall.txt Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/01/25 11:13:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py:58: UserWarning: Please install psutil to have better support with spilling [Stage 2:> (0 + 2) / 2]C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py:58: UserWarning: Please install psutil to have better support with spilling C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py:58: UserWarning: Please install psutil to have better support with spilling [Stage 3:> (0 + 2) / 2]C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py:58: UserWarning: Please install psutil to have better support with spilling C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py:58: UserWarning: Please install psutil to have better support with spilling [Stage 4:> (0 + 2) / 2]18/01/25 11:13:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 8) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 346, in func return f(iterator) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 1842, in combineLocally merger.mergeValues(iterator) File "C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack (expected 2)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

18/01/25 11:13:18 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 8, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 346, in func return f(iterator) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 1842, in combineLocally merger.mergeValues(iterator) File "C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack (expected 2)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

18/01/25 11:13:18 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job Traceback (most recent call last): File "C:/Users/nihal/PycharmProjects/p1/p1.py", line 368, in cp_rdd = cond_prob_rdd(cp_rdd, rdd) File "C:/Users/nihal/PycharmProjects/p1/p1.py", line 233, in cond_prob_rdd list_same_label = rdd_same_label.flatMap(lambda x: tuple(x[1])).reduceByKey(add).collect() File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 809, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\py4j\java_gateway.py", line 1160, in call answer, self.gateway_client, self.target_id, self.name) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\py4j\protocol.py", line 320, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 8, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 346, in func return f(iterator) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 1842, in combineLocally merger.mergeValues(iterator) File "C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack (expected 2)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:467) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 2423, in pipeline_func return func(split, prev_func(split, iterator)) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 346, in func return f(iterator) File "C:\Users\nihal\AppData\Local\conda\conda\envs\datasc\lib\site-packages\pyspark\rdd.py", line 1842, in combineLocally merger.mergeValues(iterator) File "C:\spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: too many values to unpack (expected 2)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:404)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

Process finished with exit code 1 `

WeiwenXu21 commented 6 years ago

it seems to be complaining about line 233, in cond_prob_rdd: list_same_label = rdd_same_label.flatMap(lambda x: tuple(x[1])).reduceByKey(add).collect()

This might be happening because I changed the structure of the data that is passed into function cond_prob_rdd. Mel and I will look into that this afternoon.

WeiwenXu21 commented 6 years ago

This bug should have been solved!