Sotera / pst-extraction

PST extraction and analytic pipeline
Apache License 2.0
37 stars 18 forks source link

step 3 complaining | ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException #1

Closed jorge80 closed 8 years ago

jorge80 commented 8 years ago

altered this step 3 to following command: spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt

failing on: /pst-extract$ spark-submit --master local[*] --driver-memory 2g --jars lib/tika-app-1.10.jar,lib/commons-codec-1.10.jar --conf spark.storage.memoryFraction=1 --class newman.Driver lib/tika-extract_2.10-1.0.1.jar pst-json/ spark-attach/ etc/exts.txt INFO Running Spark version 1.5.0 WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable WARN SPARK_WORKER_INSTANCES was detected (set to '4'). This is deprecated in Spark 1.0+.

Please instead use:

WARN Your hostname, precise32 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0) WARN Set SPARK_LOCAL_IP if you need to bind to another address INFO Changing view acls to: vagrant INFO Changing modify acls to: vagrant INFO SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vagrant); users with modify permissions: Set(vagrant) INFO Slf4jLogger started INFO Starting remoting INFO Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.2.15:54231] INFO Successfully started service 'sparkDriver' on port 54231. INFO Registering MapOutputTracker INFO Registering BlockManagerMaster INFO Created local directory at /tmp/blockmgr-06245dd6-1764-4ac2-a818-f83c04546e51 INFO MemoryStore started with capacity 1781.8 MB INFO HTTP File server directory is /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/httpd-11910909-50ad-4114-9d40-6a1688a10d72 INFO Starting HTTP Server INFO jetty-8.y.z-SNAPSHOT INFO Started SocketConnector@0.0.0.0:46457 INFO Successfully started service 'HTTP file server' on port 46457. INFO Registering OutputCommitCoordinator INFO jetty-8.y.z-SNAPSHOT INFO Started SelectChannelConnector@0.0.0.0:4040 INFO Successfully started service 'SparkUI' on port 4040. INFO Started SparkUI at http://10.0.2.15:4040 INFO Added JAR file:/pst-extract/lib/tika-app-1.10.jar at http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626 INFO Added JAR file:/pst-extract/lib/commons-codec-1.10.jar at http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650 INFO Added JAR file:/pst-extract/lib/tika-extract_2.10-1.0.1.jar at http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656 WARN Using default name DAGScheduler for source because spark.app.id is not set. INFO Starting executor ID driver on host localhost INFO Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57497. INFO Server created on 57497 INFO Trying to register BlockManager INFO Registering block manager localhost:57497 with 1781.8 MB RAM, BlockManagerId(driver, localhost, 57497) INFO Registered BlockManager Extension filter: List(doc, docx, txt, pdf, xls, xlsx, rtf, xml, html, htm, ppt, pptx) WARN Failed to check whether UseCompressedOops is set; assuming yes INFO ensureFreeSpace(123856) called with curMem=0, maxMem=1868326502 INFO Block broadcast_0 stored as values in memory (estimated size 121.0 KB, free 1781.7 MB) INFO ensureFreeSpace(11436) called with curMem=123856, maxMem=1868326502 INFO Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.2 KB, free 1781.6 MB) INFO Added broadcast_0_piece0 in memory on localhost:57497 (size: 11.2 KB, free: 1781.8 MB) INFO Created broadcast 0 from textFile at Driver.scala:133 INFO mapred.tip.id is deprecated. Instead, use mapreduce.task.id INFO mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id INFO mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap INFO mapred.task.partition is deprecated. Instead, use mapreduce.task.partition INFO mapred.job.id is deprecated. Instead, use mapreduce.job.id INFO Total input paths to process : 10 INFO Starting job: saveAsTextFile at Driver.scala:134 INFO Got job 0 (saveAsTextFile at Driver.scala:134) with 43 output partitions INFO Final stage: ResultStage 0(saveAsTextFile at Driver.scala:134) INFO Parents of final stage: List() INFO Missing parents: List() INFO Submitting ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134), which has no missing parents INFO ensureFreeSpace(104024) called with curMem=135292, maxMem=1868326502 INFO Block broadcast_1 stored as values in memory (estimated size 101.6 KB, free 1781.5 MB) INFO ensureFreeSpace(34556) called with curMem=239316, maxMem=1868326502 INFO Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.7 KB, free 1781.5 MB) INFO Added broadcast_1_piece0 in memory on localhost:57497 (size: 33.7 KB, free: 1781.7 MB) INFO Created broadcast 1 from broadcast at DAGScheduler.scala:861 INFO Submitting 43 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at saveAsTextFile at Driver.scala:134) INFO Adding task set 0.0 with 43 tasks INFO Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 0.0 in stage 0.0 (TID 0) INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar with timestamp 1448671847656 INFO Fetching http://10.0.2.15:46457/jars/tika-extract_2.10-1.0.1.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp6999556033463677797.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-extract_2.10-1.0.1.jar to class loader INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar with timestamp 1448671847650 INFO Fetching http://10.0.2.15:46457/jars/commons-codec-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp5562645691215034148.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/commons-codec-1.10.jar to class loader INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar with timestamp 1448671847626 INFO Fetching http://10.0.2.15:46457/jars/tika-app-1.10.jar to /tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/fetchFileTemp4410224147291656134.tmp INFO Adding file:/tmp/spark-152ba82d-5038-4152-b8bf-1db117cbe5cf/userFiles-2b6f5374-cc8d-4464-9916-e8aaccf906dd/tika-app-1.10.jar to class loader INFO Input split: file:/pst-extract/pst-json/output_part_000003:0+146275050 INFO Saved output of task 'attempt_201511280050_0000_m_000000_0' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000000 INFO attempt_201511280050_0000_m_000000_0: Committed INFO Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver INFO Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 1.0 in stage 0.0 (TID 1) INFO Finished task 0.0 in stage 0.0 (TID 0) in 1955996 ms on localhost (1/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:0+134217728 INFO Document is encrypted INFO Saved output of task 'attempt_201511280050_0000_m_000001_1' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000001 INFO attempt_201511280050_0000_m_000001_1: Committed INFO Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver INFO Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 2.0 in stage 0.0 (TID 2) INFO Finished task 1.0 in stage 0.0 (TID 1) in 2028530 ms on localhost (2/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:134217728+134217728 INFO Saved output of task 'attempt_201511280050_0000_m_000002_2' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000002 INFO attempt_201511280050_0000_m_000002_2: Committed INFO Finished task 2.0 in stage 0.0 (TID 2). 2044 bytes result sent to driver INFO Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 3.0 in stage 0.0 (TID 3) INFO Finished task 2.0 in stage 0.0 (TID 2) in 2334175 ms on localhost (3/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:268435456+134217728 ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException INFO Document is encrypted INFO Saved output of task 'attempt_201511280050_0000_m_000003_3' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000003 INFO attempt_201511280050_0000_m_000003_3: Committed INFO Finished task 3.0 in stage 0.0 (TID 3). 2044 bytes result sent to driver INFO Starting task 4.0 in stage 0.0 (TID 4, localhost, PROCESS_LOCAL, 2334 bytes) INFO Running task 4.0 in stage 0.0 (TID 4) INFO Finished task 3.0 in stage 0.0 (TID 3) in 2291524 ms on localhost (4/43) INFO Input split: file:/pst-extract/pst-json/output_part_000004:402653184+134217728 INFO Saved output of task 'attempt_201511280050_0000_m_000004_4' to file:/pst-extract/spark-attach/_temporary/0/task_201511280050_0000_m_000004 INFO attempt_201511280050_0000_m_000004_4: Committed INFO Finished task 4.0 in stage 0.0 (TID 4). 2044 bytes result sent to driver INFO Starting task 5.0 in stage 0.0 (TID 5, localhost, PROCESS_LOCAL, 2334 bytes) INFO Finished task 4.0 in stage 0.0 (TID 4) in 2228523 ms on localhost (5/43) INFO Running task 5.0 in stage 0.0 (TID 5) INFO Input split: file:/pst-extract/pst-json/output_part_000004:536870912+106907138 ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException.. .. so what is correct then ? instead of original recipe for step 3 ?

jorge80 commented 8 years ago

my bad, now it works.. data need to be really clean