JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.88k stars 712 forks source link

Wrong or missing inputCols annotators in NORMALIZER #653

Closed ticapix closed 5 years ago

ticapix commented 5 years ago

Hi,

I'm using com.johnsnowlabs.nlp-2.2.2 with spark-2.4.4 to process some articles. In those articles, there are some very long words I'm not interested in and which slows down the POS tagging a lot. I would to like to exclude them after the tokenization and before the POSTagging.

I tried to write the smaller code to reproduce my issues

import sc.implicits._
val documenter = new DocumentAssembler().setInputCol("text").setOutputCol("document").setIdCol("id")
val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token")
val normalizer = new Normalizer().setInputCols("token").setOutputCol("normalized").setLowercase(true)

val df = Seq("This is a very [useless|www.example.com] sentence").toDF("text")

val document = documenter.transform(df.withColumn("id", monotonically_increasing_id()))
val token = tokenizer.fit(document).transform(document)

val token_filtered = token
  .drop("token")
  .join(token
    .select(col("id"), col("token"))
    .withColumn("tmp", explode(col("token")))
    .filter(length(col("tmp")("result")) < 9)
    .groupBy("id")
    .agg(collect_list(col("tmp")).as("token")),
    Seq("id"))
val normal = normalizer.fit(token_filtered).transform(token_filtered)

I have this error when I transform token_filtered

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in NORMALIZER_4bde2f08742a. Received inputCols: token. Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: token

It works fine if I directly fit and transform token in normalizer

It seems that during the explode/groupBy/collect_list, some information is lost, but the schema and data looks the same.

Any idea what is happening ?

Thank you

maziyarpanahi commented 5 years ago

Yes, you are right. The column token is not just the result, it has lots of other metadata which was destroyed in your groupBy. You need to use normalizer before your join/groupBy. This way, you use normalized result in there which is the last stage and you won't be needing the metadata afterwards. PS: This makes more sense, you should normalize first then work on the results and do more cleaning depending on the length, etc.

PS2: Will be having StopWordsCleaner annotator in the next release, so there you can also define some array of strings where you can remove some stop words or anything that you think should be removed.

ticapix commented 5 years ago

It seems that this is just postponning the error to the next stage. Initially, I have this pipeline

    val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentence = new SentenceDetector().setInputCols(Array("document")).setOutputCol("sentence")
    val token = new Tokenizer().setInputCols(Array("sentence")).setOutputCol("token")
    val normalizer = new Normalizer().setInputCols("token").setOutputCol("normalized")
    val pos_tagger = PerceptronModel.load(pos_model_path.toString).setInputCols("document", "normalized").setOutputCol("pos")
    val lemmatizer = LemmatizerModel.load(lemma_model_path.toString).setInputCols("normalized").setOutputCol("lemma")
    val finisher = new Finisher().setInputCols(Array( "pos", "lemma")).setOutputCols(Array("pos", "lemma"))
    val pipeline = new Pipeline().setStages(Array(
      document,
      sentence,
      token,
      normalizer,
      pos_tagger,
      lemmatizer,
      finisher
    ))

But, because I'm parsing wiki markup, I have a lot of garbage tokens which are not words (typically urls) I would like to remove those tokens somewhere after the Tokenizer and before POS Tagger PerceptronModel

For the sake of the test, if I do the explode/groupBy before the pos tagger, I get

requirement failed: Wrong or missing inputCols annotators in POS_29fd848601e6. Received inputCols: document,normalized. Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: token, document

for the very same reason than before I guess.

Is there a way to groupBy(_:*) or similar to keep the annotators ?

(The other solution is to pre-process the text outside of spark, which is something I would like to avoid)

maziyarpanahi commented 5 years ago

I don’t think so, we need those metadata and we provide all kinds of annotators to manipulate the output of Tokenizer before it goes to POS or NER. You have Normalizer, Lemmatizer, Stemmer, StopWordsCleaner(soon), if there is anything else then it’s really custom and based on your use case. Your pipeline should work perfectly, if you intend to do something custom then you need to construct the metadata manually by looking into the code and following the same path as we did for any specific annotator.

You can also use a UDF, clean your markups easily and then use the result as an input for your pipeline. The super cleaning shouldn’t necessarily be part of the pipeline (that’s how I do it myself). But I go ahead and add cleaning html and xml markups as a feature request in Normalizer.

ticapix commented 5 years ago

Thanks for the explanations :)

maziyarpanahi commented 4 years ago

UPDATE:

In the new release there are two features added to SentenceDetector and Tokenizer which are minLenght and maxLength:

You can simply filter out the tokens you don't wish to give through the pipeline:

val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
      .setMinLength(4)
      .setMaxLength(10)