Closed jczestochowska closed 2 years ago
Hi,
Here is what I did in Python/Jupyter (same specs as yours):
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.ml import Pipeline
import sparknlp
#make sure there is no other SparkSession with fewer resources already running
spark = sparknlp.start()
documentAssembler = DocumentAssembler() \
.setInputCol('reviewText') \
.setOutputCol('document')
tokenizer = RegexTokenizer() \
.setInputCols(['document']) \
.setOutputCol('token') \
.setPattern("\\W") \
.setToLowercase(True)
stopwords_cleaner = StopWordsCleaner()\
.setInputCols(['token']) \
.setOutputCol('clean') \
.setCaseSensitive(False)
finisher = Finisher() \
.setInputCols(['clean'])
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
stopwords_cleaner,
finisher
])
Since there is a RegexTokenizer
in Spark NLP with the same pattern and lowercase parameters I replaced it with the Tokenizer and Normalizer annotators.
Also, the csv
, json
, gz
, etc. are not breakable formats in Apache Spark. They won't be distributed over all the cores, machines, etc. They are single partitioned. So I read and converted the csv
into parquet
for distribution even in a single machine:
Toys = spark.read \
.options(header= True, delimiter=',', inferSchema=True, mode='DROPMALFORMED')\
.csv(data_path)
Toys.write.parquet("./toys-cleaned")
Working with csv
in Apache Spark / PySpark having mode='DROPMALFORMED'
is very important as if it's corrupted it will be problematic.
The following code took 5 minutes on 6 cores laptop
from pyspark.sql.functions import rank, col, explode, count
Toys = spark.read \
.parquet('./toys-cleaned').repartition(12)
out = pipeline.fit(Toys).transform(Toys)
all_words = out.select(explode("finished_clean").alias("word"))
# group by, sort and limit to 50k
top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
top50k.show()
It is possible to further optimize this by using minLength and maxLength in RegexTokenizer knowing if you are not interested in tokens less or more than a certain length etc. Also, keep in mind what happens in Spark ML feature
are a very basic/simple transformation without any annotation and metadata that can be used for other NLP tasks
I also noticed the dataset is kind of skew, meaning some partitioned are much much heavier than the others so while most of them are finished they have to wait for these to be finished as well.
Hi @jczestochowska
I have found some new information. I will investigate more as to why, but for some unknown reasons if you run the very same pipeline I showed previously on pyspark==3.0.2
it goes from 5 minutes to 55 seconds!
There must be some sort of Spark config set by default in 3.1.x
compare to 3.0.x
or vice versa causes this, but I will investigate more to see why Apache Spark 3.1.x performs poorly compare to Apache Spark 3.0.x.
In Spark 3.1.x the executors are like:
In Spark 3.0.x the executors are like:
Clearly, there is something in spark 3.1.x that causes this behavior to have unbalanced partitions. Perhaps some sort of auto partitioning etc.
UPDATE:
Hi @maziyarpanahi,
thanks so much for the tips and investigation!
Very good useful post
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
Description
I have a dataset of around 2 million amazon reviews, I want to count most frequent words. For that I am tokenizing and removing stop words. I wanted to use spark-nlp to create a more sophisticated pipeline than that for later stages but even this simple one is not working for me. On the other hand an equivalent (?) pipeline in plain spark works correctly. Note that when I do
out.show()
on the spark-nlp pipeline output it shows me a correctly tokenized lists of words.Expected Behavior
Pipeline should clean the dataset and count most frequent words
Current Behavior
Pipeline freezes
Possible Solution
No idea
Steps to Reproduce
Plain spark working pipeline
Spark nlp pipeline - not working
1. 2. 3. 4.
Context
Trying to clean data with spark-nlp and perform some analysis, on a later stage I would like to use spark-nlp to process data for some classification task.
Your Environment
sparknlp.version()
: spark-nlp==3.0.1spark.version
: pyspark==3.1.1java -version
: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)