Closed SankethBK closed 3 years ago
Thanks for a complete and detailed issue. The DocumentAssembler
is the first and the beginning of the Pipeline, so the only input is the pure text coming from the DataFrame. Anything happens after that in a sentence or the tokens won't have any effect on DocumentAssembler.
That's being said, we've had an issue similar to this and had a fix in the latest release. I can see the confusion that this might be also related to None or null values. However, there is something in the text that doesn't fit well with this line inside DocumentAssembler
:
case "shrink" => text.trim.replaceAll("\\s+", " ")
You used this strategy as a cleanup. I am going to run your notebook to see what exactly in that dataset can interfere with this regex and cause that error. In the meantime, you can try other strategies (even disabling it by using disabled
) to see how it goes.
PS: In any case, those strategies shouldn't crash with exception and they should just silently skip the row if they cannot do what they asked to do. Thanks for reporting this.
@maziyarpanahi Thank you for early reply. I removed the line .setCleanupMode('shrink') and it worked. Now the code is executing for any number of rows. But in turn it is causing a small problem.
The .setContextChars in tokenizer is not able to remove paranthesis and question marks it could be because of not shrinking space between them. I initially thought because they might be getting tokeinzed as ' (' , ' )' but later i found some cases where they are tokenized as '(' , ')' but still not removed.
Anyways my main problem is resolved and thank you very much for that.
Hello @SankethBK,
thank you for your work reporting this issue. We have been working on a PR to avoid NPE propagation when null text is processed by DocumentAssembler in the assemble method. In fact the dataset you are using from Kaggle is containing some corrupted rows. In the assemble method the DocumentAssembler is expecting at least empty rows. When trying to apply a mode other than "disabled" to the null text string, the processing is resulting in a NPE.
Spark SQL APIs provides us with three modes in DataFrame Reader API to avoid corrupted record processing. In this scenario the
.option("mode", "DROPMALFORMED")
should do the work.
More in details:
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
Please also note that in spark 2.3 and prior version, CSV rows are considered as malformed if at least one column in the row is malformed. For example if there are 3 columns and there is no proper value for all the columns in case of null if there is absence of proper delimiter then the record would be considered as malformed/error/bad. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. This is due to the CSV parser column pruning which is set to true by default. It might be necessary to set the column pruning to false to get a different behavior.
The important changes are introduced in Spark 2.4.0 in the UnivocityParser.scala class and it's linked to https://issues.apache.org/jira/browse/SPARK-25387 .
I am using news dataset from kaggle. I am using a spark nlp pipeline to preprocess the data.
Link to the notebook
i got the expected output in above notebook because i am passing only first 1000 rows.
Here finished_lemma is a column which has list of tokens in each row.
+--------------------+ | finished_lemma| +--------------------+ |[House, Republica...| |[Rift, Officers, ...| |[Tyrus, Wong,, ‘B...| |[Among, Deaths, 2...| |[Kim, Jong-un, Sa...| |[Sick, Cold,, Que...| |[Taiwan’s, Presid...| |[‘The, Biggest, L...| |[First,, Mixtape....| |[Calling, Angels,...| |[Weak, Federal, P...| |[Carbon, Capture,...| |[Mar-a-Lago,, Fut...| |[form, healthy, h...| |[Turning, Vacatio...| |[Second, Avenue, ...| |[Dylann, Roof, Re...| |[Modi’s, Cash, Ba...| |[Suicide, Bombing...| |[Fecal, Pollution...| +--------------------+ only showing top 20 rows
At this point i am getting the error
full error below description.
Description
First the above code is executed on articles1.csv (link provided above)
I figured out that if i pass only first 1000 rows the code works fine, but if pass entire dataframe or > 2500 rows i am getting the error.
Afterwards i executed same code on articles2.csv abd i got the error even if i pass only 1 row.
So I can confirm that it is not related to memory
full error message
Expected Behavior
I am expecting a countvectorizer to be created, it is able to create countvectorizer siccessfully if i pass only first 1000 rows in articles1.csv.
possible solutions
I have looked at some issues in this repository having same error messages while using different functions, most of them had null values in their dataframe . As i told if i ran the same code on articles2.csv it is giving error even if i pass a single row. I manually checked if the list was empty, i verified that list of tokens was not empty.
I can confirm that there is no "None" value in the list of tokens.
I know that something in my list of tokens is getting converted to null value, i am not sure what it is.
Steps to Reproduce
I know this error is reproduced if we intentionally use "None" as one of the values in list of tokens, but this not same in my case as i confirmed that i don't have any "None" values in my list of tokens.
Context
I was trying to analyze topics present in large bunch of documents
I tried multiple versions of spraknlp including 2.6.4
Your Environment
Spark NLP version: 2.6.4, 2.6.2, 2.2.2, ... ,
Apache SPARK version: 2.4.4
Java version (java -version): 1.8.0
Setup and installation (Pypi, Conda, Maven, etc.):
Startup for sparknlp used : i am using sparknlp in jupyter pyspark --packages JohnSnowLabs:spark-nlp:2.6.4
Operating System and version: ubuntu 18.04
I can provide any further details.