altered this step 3 to following command:
spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt
failing on:
/pst-extract$ spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt
INFO Running Spark version 1.5.0
WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN
SPARK_WORKER_INSTANCES was detected (set to '4').
This is deprecated in Spark 1.0+.
Please instead use:
./spark-submit with --num-executors to specify the number of executors
Or set SPARK_EXECUTOR_INSTANCES
spark.executor.instances to configure the number of instances in the spark config.
WARN Your hostname, precise32 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
WARN Set SPARK_LOCAL_IP if you need to bind to another address
INFO Changing view acls to: vagrant
INFO Changing modify acls to: vagrant
INFO SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vagrant); users with modify permissions: Set(vagrant)
INFO Slf4jLogger started
INFO Starting remoting
INFO Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.2.15:54231]
INFO Successfully started service 'sparkDriver' on port 54231.
INFO Registering MapOutputTracker
INFO Registering BlockManagerMaster
INFO Created local directory at /tmp/blockmgr-06245dd6-1764-4ac2-a818-f83c04546e51
INFO MemoryStore started with capacity 1781.8 MB
INFO HTTP File server directory is /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/httpd-11910909-50ad-4114-9d40-6a1688a10d72
INFO Starting HTTP Server
INFO jetty-8.y.z-SNAPSHOT
INFO Started SocketConnector@0.0.0.0:46457
INFO Successfully started service 'HTTP file server' on port 46457.
INFO Registering OutputCommitCoordinator
INFO jetty-8.y.z-SNAPSHOT
INFO Started SelectChannelConnector@0.0.0.0:4040
INFO Successfully started service 'SparkUI' on port 4040.
INFO Started SparkUI at http://10.0.2.15:4040
INFO Added JAR file:/pst-extract/lib/tika-app-1.10.jar at http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626
INFO Added JAR file:/pst-extract/lib/commons-codec-1.10.jar at http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650
INFO Added JAR file:/pst-extract/lib/tika-extract_2.10-1.0.1.jar at http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656
WARN Using default name DAGScheduler for source because spark.app.id is not set.
INFO Starting executor ID driver on host localhost
INFO Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57497.
INFO Server created on 57497
INFO Trying to register BlockManager
INFO Registering block manager localhost:57497 with 1781.8 MB RAM, BlockManagerId(driver, localhost, 57497)
INFO Registered BlockManager
Extension filter: List(doc, docx, txt, pdf, xls, xlsx, rtf, xml, html, htm, ppt, pptx)
WARN Failed to check whether UseCompressedOops is set; assuming yes
INFO ensureFreeSpace(123856) called with curMem=0, maxMem=1868326502
INFO Block broadcast_0 stored as values in memory (estimated size 121.0 KB, free 1781.7 MB)
INFO ensureFreeSpace(11436) called with curMem=123856, maxMem=1868326502
INFO Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.2 KB, free 1781.6 MB)
INFO Added broadcast_0_piece0 in memory on localhost:57497 (size: 11.2 KB, free: 1781.8 MB)
INFO Created broadcast 0 from textFile at Driver.scala:133
INFO mapred.tip.id is deprecated. Instead, use mapreduce.task.id
INFO mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
INFO mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
INFO mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
INFO mapred.job.id is deprecated. Instead, use mapreduce.job.id
INFO Total input paths to process : 10
INFO Starting job: saveAsTextFile at Driver.scala:134
INFO Got job 0 (saveAsTextFile at Driver.scala:134) with 43 output partitions
INFO Final stage: ResultStage 0(saveAsTextFile at Driver.scala:134)
INFO Parents of final stage: List()
INFO Missing parents: List()
INFO Submitting ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134), which has no missing parents
INFO ensureFreeSpace(104024) called with curMem=135292, maxMem=1868326502
INFO Block broadcast_1 stored as values in memory (estimated size 101.6 KB, free 1781.5 MB)
INFO ensureFreeSpace(34556) called with curMem=239316, maxMem=1868326502
INFO Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.7 KB, free 1781.5 MB)
INFO Added broadcast_1_piece0 in memory on localhost:57497 (size: 33.7 KB, free: 1781.7 MB)
INFO Created broadcast 1 from broadcast at DAGScheduler.scala:861
INFO Submitting 43 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134)
INFO Adding task set 0.0 with 43 tasks
INFO Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 0.0 in stage 0.0 (TID 0)
INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656
INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp6999556033463677797.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-extract_2.10-1.0.1.jar to class loader
INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650
INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp5562645691215034148.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/commons-codec-1.10.jar to class loader
INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626
INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp4410224147291656134.tmp
INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-app-1.10.jar to class loader
INFO Input split: file:/pst-extract/pst-json/output_part_000003:0+146275050
INFO Saved output of task 'attempt_201511280050_0000_m_000000_0' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000000
INFO attempt_201511280050_0000_m_000000_0: Committed
INFO Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver
INFO Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 1.0 in stage 0.0 (TID 1)
INFO Finished task 0.0 in stage 0.0 (TID 0) in 1955996 ms on localhost (1/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:0+134217728
INFO Document is encrypted
INFO Saved output of task 'attempt_201511280050_0000_m_000001_1' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000001
INFO attempt_201511280050_0000_m_000001_1: Committed
INFO Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver
INFO Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 2.0 in stage 0.0 (TID 2)
INFO Finished task 1.0 in stage 0.0 (TID 1) in 2028530 ms on localhost (2/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:134217728+134217728
INFO Saved output of task 'attempt_201511280050_0000_m_000002_2' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000002
INFO attempt_201511280050_0000_m_000002_2: Committed
INFO Finished task 2.0 in stage 0.0 (TID 2). 2044 bytes result sent to driver
INFO Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 3.0 in stage 0.0 (TID 3)
INFO Finished task 2.0 in stage 0.0 (TID 2) in 2334175 ms on localhost (3/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:268435456+134217728
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
INFO Document is encrypted
INFO Saved output of task 'attempt_201511280050_0000_m_000003_3' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000003
INFO attempt_201511280050_0000_m_000003_3: Committed
INFO Finished task 3.0 in stage 0.0 (TID 3). 2044 bytes result sent to driver
INFO Starting task 4.0 in stage 0.0 (TID 4, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Running task 4.0 in stage 0.0 (TID 4)
INFO Finished task 3.0 in stage 0.0 (TID 3) in 2291524 ms on localhost (4/43)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:402653184+134217728
INFO Saved output of task 'attempt_201511280050_0000_m_000004_4' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000004
INFO attempt_201511280050_0000_m_000004_4: Committed
INFO Finished task 4.0 in stage 0.0 (TID 4). 2044 bytes result sent to driver
INFO Starting task 5.0 in stage 0.0 (TID 5, localhost, PROCESS_LOCAL, 2334 bytes)
INFO Finished task 4.0 in stage 0.0 (TID 4) in 2228523 ms on localhost (5/43)
INFO Running task 5.0 in stage 0.0 (TID 5)
INFO Input split: file:/pst-extract/pst-json/output_part_000004:536870912+106907138
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException..
.. so what is correct then ? instead of original recipe for step 3 ?
altered this step 3 to following command: spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt
failing on: /pst-extract$ spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt INFO Running Spark version 1.5.0 WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable WARN SPARK_WORKER_INSTANCES was detected (set to '4'). This is deprecated in Spark 1.0+.
Please instead use:
WARN Your hostname, precise32 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0) WARN Set SPARK_LOCAL_IP if you need to bind to another address INFO Changing view acls to: vagrant INFO Changing modify acls to: vagrant INFO SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vagrant); users with modify permissions: Set(vagrant) INFO Slf4jLogger started INFO Starting remoting INFO Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.2.15:54231] INFO Successfully started service 'sparkDriver' on port 54231. INFO Registering MapOutputTracker INFO Registering BlockManagerMaster INFO Created local directory at /tmp/blockmgr-06245dd6-1764-4ac2-a818-f83c04546e51 INFO MemoryStore started with capacity 1781.8 MB INFO HTTP File server directory is /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/httpd-11910909-50ad-4114-9d40-6a1688a10d72 INFO Starting HTTP Server INFO jetty-8.y.z-SNAPSHOT INFO Started SocketConnector@0.0.0.0:46457 INFO Successfully started service 'HTTP file server' on port 46457. INFO Registering OutputCommitCoordinator INFO jetty-8.y.z-SNAPSHOT INFO Started SelectChannelConnector@0.0.0.0:4040 INFO Successfully started service 'SparkUI' on port 4040. INFO Started SparkUI at http://10.0.2.15:4040 INFO Added JAR file:/pst-extract/lib/tika-app-1.10.jar at http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626 INFO Added JAR file:/pst-extract/lib/commons-codec-1.10.jar at http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650 INFO Added JAR file:/pst-extract/lib/tika-extract_2.10-1.0.1.jar at http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656 WARN Using default name DAGScheduler for source because spark.app.id is not set. INFO Starting executor ID driver on host localhost INFO Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57497. INFO Server created on 57497 INFO Trying to register BlockManager INFO Registering block manager localhost:57497 with 1781.8 MB RAM, BlockManagerId(driver, localhost, 57497) INFO Registered BlockManager Extension filter: List(doc, docx, txt, pdf, xls, xlsx, rtf, xml, html, htm, ppt, pptx) WARN Failed to check whether UseCompressedOops is set; assuming yes INFO ensureFreeSpace(123856) called with curMem=0, maxMem=1868326502 INFO Block broadcast_0 stored as values in memory (estimated size 121.0 KB, free 1781.7 MB) INFO ensureFreeSpace(11436) called with curMem=123856, maxMem=1868326502 INFO Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.2 KB, free 1781.6 MB) INFO Added broadcast_0_piece0 in memory on localhost:57497 (size: 11.2 KB, free: 1781.8 MB) INFO Created broadcast 0 from textFile at Driver.scala:133 INFO mapred.tip.id is deprecated. Instead, use mapreduce.task.id INFO mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id INFO mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap INFO mapred.task.partition is deprecated. Instead, use mapreduce.task.partition INFO mapred.job.id is deprecated. Instead, use mapreduce.job.id INFO Total input paths to process : 10 INFO Starting job: saveAsTextFile at Driver.scala:134 INFO Got job 0 (saveAsTextFile at Driver.scala:134) with 43 output partitions INFO Final stage: ResultStage 0(saveAsTextFile at Driver.scala:134) INFO Parents of final stage: List() INFO Missing parents: List() INFO Submitting ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134), which has no missing parents INFO ensureFreeSpace(104024) called with curMem=135292, maxMem=1868326502 INFO Block broadcast_1 stored as values in memory (estimated size 101.6 KB, free 1781.5 MB) INFO ensureFreeSpace(34556) called with curMem=239316, maxMem=1868326502 INFO Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.7 KB, free 1781.5 MB) INFO Added broadcast_1_piece0 in memory on localhost:57497 (size: 33.7 KB, free: 1781.7 MB) INFO Created broadcast 1 from broadcast at DAGScheduler.scala:861 INFO Submitting 43 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134) INFO Adding task set 0.0 with 43 tasks INFO Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 0.0 in stage 0.0 (TID 0) INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656 INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp6999556033463677797.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-extract_2.10-1.0.1.jar to class loader INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650 INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp5562645691215034148.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/commons-codec-1.10.jar to class loader INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626 INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp4410224147291656134.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-app-1.10.jar to class loader INFO Input split: file:/pst-extract/pst-json/output_part_000003:0+146275050 INFO Saved output of task 'attempt_201511280050_0000_m_000000_0' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000000 INFO attempt_201511280050_0000_m_000000_0: Committed INFO Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver INFO Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 1.0 in stage 0.0 (TID 1) INFO Finished task 0.0 in stage 0.0 (TID 0) in 1955996 ms on localhost (1/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:0+134217728 INFO Document is encrypted INFO Saved output of task 'attempt_201511280050_0000_m_000001_1' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000001 INFO attempt_201511280050_0000_m_000001_1: Committed INFO Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver INFO Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 2.0 in stage 0.0 (TID 2) INFO Finished task 1.0 in stage 0.0 (TID 1) in 2028530 ms on localhost (2/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:134217728+134217728 INFO Saved output of task 'attempt_201511280050_0000_m_000002_2' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000002 INFO attempt_201511280050_0000_m_000002_2: Committed INFO Finished task 2.0 in stage 0.0 (TID 2). 2044 bytes result sent to driver INFO Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 3.0 in stage 0.0 (TID 3) INFO Finished task 2.0 in stage 0.0 (TID 2) in 2334175 ms on localhost (3/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:268435456+134217728 ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException INFO Document is encrypted INFO Saved output of task 'attempt_201511280050_0000_m_000003_3' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000003 INFO attempt_201511280050_0000_m_000003_3: Committed INFO Finished task 3.0 in stage 0.0 (TID 3). 2044 bytes result sent to driver INFO Starting task 4.0 in stage 0.0 (TID 4, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 4.0 in stage 0.0 (TID 4) INFO Finished task 3.0 in stage 0.0 (TID 3) in 2291524 ms on localhost (4/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:402653184+134217728 INFO Saved output of task 'attempt_201511280050_0000_m_000004_4' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000004 INFO attempt_201511280050_0000_m_000004_4: Committed INFO Finished task 4.0 in stage 0.0 (TID 4). 2044 bytes result sent to driver INFO Starting task 5.0 in stage 0.0 (TID 5, localhost, PROCESS_LOCAL, 2334 bytes) INFO Finished task 4.0 in stage 0.0 (TID 4) in 2228523 ms on localhost (5/43) INFO Running task 5.0 in stage 0.0 (TID 5) INFO Input split: file:/pst-extract/pst-json/output_part_000004:536870912+106907138 ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException.. .. so what is correct then ? instead of original recipe for step 3 ?