Constannnnnt / Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.

https://github.com/Constannnnnt/Distributed-CoreNLP

MIT License

0 stars 0 forks source link

apache-spark big-data java mapreduce mapreduce-java natural-language-processing nlp nlp-parsing spark stanford-corenlp

readme

Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark, aims at processing documents annotations at large scale.

Build with Maven

Make sure you have Maven installed, details here: https://maven.apache.org/
If you run this command in the Spark-CoreNLP directory: mvn clean package , it should build this jar file: target/project-1.0.jar

Run with MapReduce

3.2. Now, run a job using the following command:

hadoop jar target/project-1.0.jar ca.uwaterloo.cs651.MapReduce.CoreNLPMapReduce -input ${input path} -output ${output path} -functionality ${func1,func2,func3,...}

Run with Spark

3.1. Now, run a job using the following command:

spark-submit --class ca.uwaterloo.cs651.project.CoreNLP --num-executors ${num of mappers} --executor-cores ${num of mappers} --conf spark.executor.heartbeatInterval=10s --conf spark.network.timeout=20s --driver-memory 6G --executor-memory 20G target/project-1.0.jar -input ${input path} -output ${output path} -mappers $mappers -functionality ${func1,func2,func3,...}