Implemetation takes really long for giving putput

databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark

GNU General Public License v3.0

422 stars 120 forks source link

Implemetation takes really long for giving putput #22

Open raviranjan-innoplexus opened 7 years ago

raviranjan-innoplexus commented 7 years ago

Hi, I am using this library but am getting extremely slow results. For 10k records containing some texts, it has taken longer than 16 hours to process 160 tasks out of 1920 after re-partitioning. I am wonder if the name extraction is working parallely or do other executors queue one after the other for name entity recognition to happen. Python non-parallel scripts seem to work faster than this. Any suggestion, work arounds would be highly appreciated

semantiDan commented 6 years ago

I'm experiencing the same extreme slowness when performing a benchmark against NLTK (Vader) and Spark-core (JohnSnow).

For 1 million rows of sentiment analysis:

Spark-Core NLP (JohnSnow 1.6.3) finishes the job in 4 min 30 secs.
NLTK (Vader) NLP finishes the job in 6 min 30 secs.
Stanford-Core NLP never finishes the job, takes more than 1 hour.