Constannnnnt / Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.
https://github.com/Constannnnnt/Distributed-CoreNLP
MIT License
0 stars 0 forks source link

Efficiency #21

Closed ji-xin closed 5 years ago

ji-xin commented 5 years ago

We are not sure which of the following is actually done by Spark:

  1. construct a StanforCoreNLP object at the initialization of a mapper, then use it to map each input
  2. construct a new StanfordCoreNLP object for each input

If 2 is the case, the efficiency will be unbearable.

Experiments are running at the moment to check which is true.

Constannnnnt commented 5 years ago

From the current setting, I guess it is 2.

Tested it with broadcasting StanfordCoreNLP, failed.

ji-xin commented 5 years ago
➜  cat run.sh 
spark-submit --class ca.uwaterloo.cs651.project.CoreNLP --conf spark.executor.heartbeatInterval=20s --conf spark.network.timeout=600s --driver-memory 6G --executor-memory 20G target/project-1.0.jar -input sampledata -output output -functionality tokenize,cleanxml,ssplit,pos,lemma,ner,parse,depparse,sentiment,natlog,coref,relation

Experiment 1

➜  wc -l sampledata 
3 sampledata
time ./run.sh
./run.sh  343.36s user 31.80s system 421% cpu 1:29.07 total

Experiment 2

➜  for i in `seq 1 100` do
for> cat sampledata >> bigdata

➜  wc -l bigdata 
303 bigdata
time ./run.sh # in run.sh sampledata is changed into bigdata
CPU time limit exceeded
./run.sh  3527.43s user 75.91s system 120% cpu 49:46.18 total

Conclusion

It sucks.

ji-xin commented 5 years ago

Possible solutions:

  1. use mapPartitions instead of map
  2. class SerializableStanfordCoreNLP extends StanfordCoreNLP implements java.io.Serializable
  3. rewrite in hadoop :(

The first one seems to be the easiest. Let me try it.

Constannnnnt commented 5 years ago

After some research, I am in the third solution. :(

ji-xin commented 5 years ago

Using solution 1

spark-submit --class ca.uwaterloo.cs651.project.CoreNLP --conf spark.executor.heartbeatInterval=1s --conf spark.network.timeout=2s --driver-memory 6G --executor-memory 20G target/project-1.0.jar -input sampledata -output output -functionality tokenize,ssplit,pos,ner,parse
  1. sampledata ./run.sh 1:05.67

  2. bigdata ./run.sh 46:41.54

Looks not bad.

One thing is for sure: things such as 2018-11-21 14:08:38 INFO StanfordCoreNLP:88 - Adding annotator tokenize now only appear once for each functionalities.

ji-xin commented 5 years ago

When I try to run this on the datasci cluster, I encounter the following problem

2018-11-21 19:43:22 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-11-21 19:43:23 INFO  CoreNLP:137 - Tool: CoreNLP
2018-11-21 19:43:23 INFO  CoreNLP:138 - input path: sampledata
2018-11-21 19:43:23 INFO  CoreNLP:139 - output path: output
2018-11-21 19:43:23 INFO  CoreNLP:140 -  - functionalities: tokenize,ssplit,pos,ner,parse
2018-11-21 19:43:23 INFO  SparkContext:54 - Running Spark version 2.3.1
2018-11-21 19:43:23 INFO  SparkContext:54 - Submitted application: CoreNLP
2018-11-21 19:43:23 INFO  SecurityManager:54 - Changing view acls to: j9xin
2018-11-21 19:43:23 INFO  SecurityManager:54 - Changing modify acls to: j9xin
2018-11-21 19:43:23 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-11-21 19:43:23 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-11-21 19:43:23 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(j9xin); groups with view permissions: Set(); users  with modify permissions: Set(j9xin); groups with modify permissions: Set()
2018-11-21 19:43:23 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 43661.
2018-11-21 19:43:23 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-11-21 19:43:23 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-11-21 19:43:23 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-11-21 19:43:23 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-11-21 19:43:23 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-8a36b07b-fc5c-4cc0-9451-61f4660635b4
2018-11-21 19:43:23 INFO  MemoryStore:54 - MemoryStore started with capacity 3.0 GB
2018-11-21 19:43:23 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-11-21 19:43:23 INFO  log:192 - Logging initialized @1969ms
2018-11-21 19:43:23 INFO  Server:346 - jetty-9.3.z-SNAPSHOT
2018-11-21 19:43:23 INFO  Server:414 - Started @2036ms
2018-11-21 19:43:23 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-11-21 19:43:23 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2018-11-21 19:43:23 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
2018-11-21 19:43:23 INFO  AbstractConnector:278 - Started ServerConnector@26f143ed{HTTP/1.1,[http/1.1]}{0.0.0.0:4043}
2018-11-21 19:43:23 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4043.
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1bb9aa43{/jobs,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@e27ba81{/jobs/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54336c81{/jobs/job,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35e52059{/jobs/job/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62577d6{/stages,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@49bd54f7{/stages/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b5f8707{/stages/stage,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@79ab3a71{/stages/stage/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e5bfdfc{/stages/pool,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3d829787{/stages/pool/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@71652c98{/storage,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51bde877{/storage/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@60b85ba1{/storage/rdd,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@492fc69e{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@117632cf{/environment,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2fb68ec6{/environment/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@d71adc2{/executors,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3add81c4{/executors/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a1d3c1a{/executors/threadDump,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c65121{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@159e366{/static,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@49bf29c6{/,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7ee55e70{/api,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@126be319{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6c44052e{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-11-21 19:43:23 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://datasci.datasci-domain.cs.uwaterloo.ca:4043
2018-11-21 19:43:23 INFO  SparkContext:54 - Added JAR file:/u5/j9xin/Spark-CoreNLP/target/project-1.0.jar at spark://datasci.datasci-domain.cs.uwaterloo.ca:43661/jars/project-1.0.jar with timestamp 1542847403868
2018-11-21 19:43:24 INFO  RMProxy:98 - Connecting to ResourceManager at datasci.datasci-domain.cs.uwaterloo.ca/192.168.167.1:8032
2018-11-21 19:43:24 INFO  Client:54 - Requesting a new application from cluster with 5 NodeManagers
2018-11-21 19:43:24 INFO  Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (32768 MB per container)
2018-11-21 19:43:24 INFO  Client:54 - Will allocate AM container, with 896 MB memory including 384 MB overhead
2018-11-21 19:43:24 INFO  Client:54 - Setting up container launch context for our AM
2018-11-21 19:43:24 INFO  Client:54 - Setting up the launch environment for our AM container
2018-11-21 19:43:24 INFO  Client:54 - Preparing resources for our AM container
2018-11-21 19:43:25 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-11-21 19:43:27 INFO  Client:54 - Uploading resource file:/tmp/spark-43c47642-9007-4468-a402-8305e01479d3/__spark_libs__826626117090988318.zip -> hdfs://name1.datasci-domain.cs.uwaterloo.ca:8020/user/j9xin/.sparkStaging/application_1542745215044_1861/__spark_libs__826626117090988318.zip
2018-11-21 19:43:35 INFO  Client:54 - Uploading resource file:/tmp/spark-43c47642-9007-4468-a402-8305e01479d3/__spark_conf__4769921634370175314.zip -> hdfs://name1.datasci-domain.cs.uwaterloo.ca:8020/user/j9xin/.sparkStaging/application_1542745215044_1861/__spark_conf__.zip
2018-11-21 19:43:35 INFO  SecurityManager:54 - Changing view acls to: j9xin
2018-11-21 19:43:35 INFO  SecurityManager:54 - Changing modify acls to: j9xin
2018-11-21 19:43:35 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-11-21 19:43:35 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-11-21 19:43:35 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(j9xin); groups with view permissions: Set(); users  with modify permissions: Set(j9xin); groups with modify permissions: Set()
2018-11-21 19:43:35 INFO  Client:54 - Submitting application application_1542745215044_1861 to ResourceManager
2018-11-21 19:43:35 INFO  YarnClientImpl:273 - Submitted application application_1542745215044_1861
2018-11-21 19:43:35 INFO  SchedulerExtensionServices:54 - Starting Yarn extension services with app application_1542745215044_1861 and attemptId None
2018-11-21 19:43:36 INFO  Client:54 - Application report for application_1542745215044_1861 (state: ACCEPTED)
2018-11-21 19:43:36 INFO  Client:54 - 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1542847415598
     final status: UNDEFINED
     tracking URL: http://192.168.167.1:8089/proxy/application_1542745215044_1861/
     user: j9xin
2018-11-21 19:43:37 INFO  Client:54 - Application report for application_1542745215044_1861 (state: ACCEPTED)
2018-11-21 19:43:38 INFO  Client:54 - Application report for application_1542745215044_1861 (state: ACCEPTED)
2018-11-21 19:43:39 INFO  Client:54 - Application report for application_1542745215044_1861 (state: ACCEPTED)
2018-11-21 19:43:39 INFO  YarnClientSchedulerBackend:54 - Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> datasci.datasci-domain.cs.uwaterloo.ca, PROXY_URI_BASES -> http://datasci.datasci-domain.cs.uwaterloo.ca:8088/proxy/application_1542745215044_1861), /proxy/application_1542745215044_1861
2018-11-21 19:43:39 INFO  JettyUtils:54 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2018-11-21 19:43:40 INFO  YarnSchedulerBackend$YarnSchedulerEndpoint:54 - ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
2018-11-21 19:43:40 INFO  Client:54 - Application report for application_1542745215044_1861 (state: RUNNING)
2018-11-21 19:43:40 INFO  Client:54 - 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 192.168.167.102
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1542847415598
     final status: UNDEFINED
     tracking URL: http://192.168.167.1:8089/proxy/application_1542745215044_1861/
     user: j9xin
2018-11-21 19:43:40 INFO  YarnClientSchedulerBackend:54 - Application application_1542745215044_1861 has started running.
2018-11-21 19:43:40 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35221.
2018-11-21 19:43:40 INFO  NettyBlockTransferService:54 - Server created on datasci.datasci-domain.cs.uwaterloo.ca:35221
2018-11-21 19:43:40 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-11-21 19:43:40 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, datasci.datasci-domain.cs.uwaterloo.ca, 35221, None)
2018-11-21 19:43:40 INFO  BlockManagerMasterEndpoint:54 - Registering block manager datasci.datasci-domain.cs.uwaterloo.ca:35221 with 3.0 GB RAM, BlockManagerId(driver, datasci.datasci-domain.cs.uwaterloo.ca, 35221, None)
2018-11-21 19:43:40 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, datasci.datasci-domain.cs.uwaterloo.ca, 35221, None)
2018-11-21 19:43:40 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, datasci.datasci-domain.cs.uwaterloo.ca, 35221, None)
2018-11-21 19:43:40 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2dfa02c1{/metrics/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:44 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.167.101:56738) with ID 2
2018-11-21 19:43:44 INFO  BlockManagerMasterEndpoint:54 - Registering block manager hadoop1.datasci-domain.cs.uwaterloo.ca:43567 with 10.5 GB RAM, BlockManagerId(2, hadoop1.datasci-domain.cs.uwaterloo.ca, 43567, None)
2018-11-21 19:43:45 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.167.103:39686) with ID 1
2018-11-21 19:43:45 INFO  YarnClientSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
2018-11-21 19:43:45 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 648.0 B, free 3.0 GB)
2018-11-21 19:43:45 INFO  BlockManagerMasterEndpoint:54 - Registering block manager hadoop3.datasci-domain.cs.uwaterloo.ca:37459 with 10.5 GB RAM, BlockManagerId(1, hadoop3.datasci-domain.cs.uwaterloo.ca, 37459, None)
2018-11-21 19:43:45 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 284.0 B, free 3.0 GB)
2018-11-21 19:43:45 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on datasci.datasci-domain.cs.uwaterloo.ca:35221 (size: 284.0 B, free: 3.0 GB)
2018-11-21 19:43:45 INFO  SparkContext:54 - Created broadcast 0 from broadcast at CoreNLP.java:158
2018-11-21 19:43:45 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/u5/j9xin/Spark-CoreNLP/spark-warehouse/').
2018-11-21 19:43:45 INFO  SharedState:54 - Warehouse path is 'file:/u5/j9xin/Spark-CoreNLP/spark-warehouse/'.
2018-11-21 19:43:45 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@71789580{/SQL,null,AVAILABLE,@Spark}
2018-11-21 19:43:45 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@23ee2ccf{/SQL/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:45 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@9d3d54e{/SQL/execution,null,AVAILABLE,@Spark}
2018-11-21 19:43:45 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2f04993d{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-11-21 19:43:45 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16f15ae9{/static/sql,null,AVAILABLE,@Spark}
2018-11-21 19:43:45 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-11-21 19:43:46 INFO  FileSourceStrategy:54 - Pruning directories with: 
2018-11-21 19:43:46 INFO  FileSourceStrategy:54 - Post-Scan Filters: 
2018-11-21 19:43:46 INFO  FileSourceStrategy:54 - Output Data Schema: struct<value: string>
2018-11-21 19:43:46 INFO  FileSourceScanExec:54 - Pushed Filters: 
2018-11-21 19:43:47 INFO  CodeGenerator:54 - Code generated in 241.781555 ms
2018-11-21 19:43:47 INFO  MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 292.9 KB, free 3.0 GB)
2018-11-21 19:43:47 INFO  MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 25.0 KB, free 3.0 GB)
2018-11-21 19:43:47 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on datasci.datasci-domain.cs.uwaterloo.ca:35221 (size: 25.0 KB, free: 3.0 GB)
2018-11-21 19:43:47 INFO  SparkContext:54 - Created broadcast 1 from javaRDD at CoreNLP.java:160
2018-11-21 19:43:47 INFO  FileSourceScanExec:54 - Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
2018-11-21 19:43:47 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-11-21 19:43:47 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-11-21 19:43:47 INFO  SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Registering RDD 5 (mapPartitionsToPair at CoreNLP.java:164)
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Got job 0 (runJob at SparkHadoopWriter.scala:78) with 5 output partitions
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Final stage: ResultStage 1 (runJob at SparkHadoopWriter.scala:78)
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at mapPartitionsToPair at CoreNLP.java:164), which has no missing parents
2018-11-21 19:43:47 INFO  MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 12.1 KB, free 3.0 GB)
2018-11-21 19:43:47 INFO  MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.5 KB, free 3.0 GB)
2018-11-21 19:43:47 INFO  BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on datasci.datasci-domain.cs.uwaterloo.ca:35221 (size: 6.5 KB, free: 3.0 GB)
2018-11-21 19:43:47 INFO  SparkContext:54 - Created broadcast 2 from broadcast at DAGScheduler.scala:1039
2018-11-21 19:43:47 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at mapPartitionsToPair at CoreNLP.java:164) (first 15 tasks are for partitions Vector(0))
2018-11-21 19:43:47 INFO  YarnScheduler:54 - Adding task set 0.0 with 1 tasks
2018-11-21 19:43:47 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2, partition 0, NODE_LOCAL, 8445 bytes)
2018-11-21 19:43:49 ERROR TransportRequestHandler:226 - Error sending result StreamResponse{streamId=/jars/project-1.0.jar, byteCount=553001604, body=FileSegmentManagedBuffer{file=/u5/j9xin/Spark-CoreNLP/target/project-1.0.jar, offset=0, length=553001604}} to /192.168.167.101:56740; closing connection
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
    at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:145)
    at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
    at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
    at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
    at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:368)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
2018-11-21 19:43:49 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2): java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)

2018-11-21 19:43:49 INFO  TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 1, hadoop3.datasci-domain.cs.uwaterloo.ca, executor 1, partition 0, NODE_LOCAL, 8445 bytes)
2018-11-21 19:43:51 ERROR TransportRequestHandler:226 - Error sending result StreamResponse{streamId=/jars/project-1.0.jar, byteCount=553001604, body=FileSegmentManagedBuffer{file=/u5/j9xin/Spark-CoreNLP/target/project-1.0.jar, offset=0, length=553001604}} to /192.168.167.103:39688; closing connection
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
    at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:145)
    at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
    at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
    at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
    at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:368)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
2018-11-21 19:43:52 INFO  TaskSetManager:54 - Lost task 0.1 in stage 0.0 (TID 1) on hadoop3.datasci-domain.cs.uwaterloo.ca, executor 1: java.nio.channels.ClosedChannelException (null) [duplicate 1]
2018-11-21 19:43:52 INFO  TaskSetManager:54 - Starting task 0.2 in stage 0.0 (TID 2, hadoop3.datasci-domain.cs.uwaterloo.ca, executor 1, partition 0, NODE_LOCAL, 8445 bytes)
2018-11-21 19:43:54 ERROR TransportRequestHandler:226 - Error sending result StreamResponse{streamId=/jars/project-1.0.jar, byteCount=553001604, body=FileSegmentManagedBuffer{file=/u5/j9xin/Spark-CoreNLP/target/project-1.0.jar, offset=0, length=553001604}} to /192.168.167.103:39690; closing connection
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
    at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:145)
    at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
    at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
    at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
    at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:368)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
2018-11-21 19:43:54 INFO  TaskSetManager:54 - Lost task 0.2 in stage 0.0 (TID 2) on hadoop3.datasci-domain.cs.uwaterloo.ca, executor 1: java.nio.channels.ClosedChannelException (null) [duplicate 2]
2018-11-21 19:43:54 INFO  TaskSetManager:54 - Starting task 0.3 in stage 0.0 (TID 3, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2, partition 0, NODE_LOCAL, 8445 bytes)
2018-11-21 19:43:56 ERROR TransportRequestHandler:226 - Error sending result StreamResponse{streamId=/jars/project-1.0.jar, byteCount=553001604, body=FileSegmentManagedBuffer{file=/u5/j9xin/Spark-CoreNLP/target/project-1.0.jar, offset=0, length=553001604}} to /192.168.167.101:56744; closing connection
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
    at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:145)
    at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
    at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
    at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
    at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:368)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
2018-11-21 19:43:56 INFO  TaskSetManager:54 - Lost task 0.3 in stage 0.0 (TID 3) on hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2: java.nio.channels.ClosedChannelException (null) [duplicate 3]
2018-11-21 19:43:56 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 4 times; aborting job
2018-11-21 19:43:56 INFO  YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2018-11-21 19:43:56 INFO  YarnScheduler:54 - Cancelling stage 0
2018-11-21 19:43:56 INFO  DAGScheduler:54 - ShuffleMapStage 0 (mapPartitionsToPair at CoreNLP.java:164) failed in 8.747 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2): java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-11-21 19:43:56 INFO  DAGScheduler:54 - Job 0 failed: runJob at SparkHadoopWriter.scala:78, took 8.796281 s
2018-11-21 19:43:56 ERROR SparkHadoopWriter:91 - Aborting job job_20181121194347_0008.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2): java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)
    at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:330)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:96)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)
    at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:330)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop1.datasci-domain.cs.uwaterloo.ca, executor 2): java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
    ... 41 more
Caused by: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
2018-11-21 19:43:56 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-11-21 19:43:56 INFO  AbstractConnector:318 - Stopped Spark@26f143ed{HTTP/1.1,[http/1.1]}{0.0.0.0:4043}
2018-11-21 19:43:56 INFO  SparkUI:54 - Stopped Spark web UI at http://datasci.datasci-domain.cs.uwaterloo.ca:4043
2018-11-21 19:43:56 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
2018-11-21 19:43:56 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
2018-11-21 19:43:56 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor to shut down
2018-11-21 19:43:56 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2018-11-21 19:43:56 INFO  YarnClientSchedulerBackend:54 - Stopped
2018-11-21 19:43:56 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-11-21 19:43:56 INFO  MemoryStore:54 - MemoryStore cleared
2018-11-21 19:43:56 INFO  BlockManager:54 - BlockManager stopped
2018-11-21 19:43:56 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-11-21 19:43:56 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-11-21 19:43:56 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-11-21 19:43:56 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-11-21 19:43:56 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-69182963-585e-4aea-82f9-67f5fb78cae5
2018-11-21 19:43:56 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-43c47642-9007-4468-a402-8305e01479d3
ji-xin commented 5 years ago

On datasci cluster

  1. sample data 1:45.46
  2. big data ~49 min
  3. big data with --num-executors 4 --executor-cores 4 ~49min

The ratio is acceptable (~30, yet the dataset is 100 times larger). However I don't know how to enable multiple mapper.

KaisongHuang commented 5 years ago

A possible solution is that we can divide one "big data" file into several small files manually and submit multiple parallel spark jobs to deal with these small files. And these spark jobs can be managed by a thread pool.

What do you think?

ji-xin commented 5 years ago

Let me try to manually designate numOfPartitions in the source code tomorrow.

Result: run with --num-executors 4 --executor-cores 4 -mappers 4: 13min57sec