Closed ticapix closed 5 years ago
Yes, you are right. The column token
is not just the result, it has lots of other metadata which was destroyed in your groupBy. You need to use normalizer before your join/groupBy
. This way, you use normalized
result in there which is the last stage and you won't be needing the metadata afterwards.
PS: This makes more sense, you should normalize first then work on the results and do more cleaning depending on the length, etc.
PS2: Will be having StopWordsCleaner
annotator in the next release, so there you can also define some array of strings where you can remove some stop words or anything that you think should be removed.
It seems that this is just postponning the error to the next stage. Initially, I have this pipeline
val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentence = new SentenceDetector().setInputCols(Array("document")).setOutputCol("sentence")
val token = new Tokenizer().setInputCols(Array("sentence")).setOutputCol("token")
val normalizer = new Normalizer().setInputCols("token").setOutputCol("normalized")
val pos_tagger = PerceptronModel.load(pos_model_path.toString).setInputCols("document", "normalized").setOutputCol("pos")
val lemmatizer = LemmatizerModel.load(lemma_model_path.toString).setInputCols("normalized").setOutputCol("lemma")
val finisher = new Finisher().setInputCols(Array( "pos", "lemma")).setOutputCols(Array("pos", "lemma"))
val pipeline = new Pipeline().setStages(Array(
document,
sentence,
token,
normalizer,
pos_tagger,
lemmatizer,
finisher
))
But, because I'm parsing wiki markup, I have a lot of garbage tokens which are not words (typically urls)
I would like to remove those tokens somewhere after the Tokenizer
and before POS Tagger PerceptronModel
For the sake of the test, if I do the explode
/groupBy
before the pos tagger, I get
requirement failed: Wrong or missing inputCols annotators in POS_29fd848601e6. Received inputCols: document,normalized. Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: token, document
for the very same reason than before I guess.
Is there a way to groupBy(_:*)
or similar to keep the annotators ?
(The other solution is to pre-process the text outside of spark, which is something I would like to avoid)
I don’t think so, we need those metadata and we provide all kinds of annotators to manipulate the output of Tokenizer before it goes to POS or NER. You have Normalizer, Lemmatizer, Stemmer, StopWordsCleaner(soon), if there is anything else then it’s really custom and based on your use case. Your pipeline should work perfectly, if you intend to do something custom then you need to construct the metadata manually by looking into the code and following the same path as we did for any specific annotator.
You can also use a UDF, clean your markups easily and then use the result as an input for your pipeline. The super cleaning shouldn’t necessarily be part of the pipeline (that’s how I do it myself). But I go ahead and add cleaning html and xml markups as a feature request in Normalizer.
Thanks for the explanations :)
UPDATE:
In the new release there are two features added to SentenceDetector
and Tokenizer
which are minLenght
and maxLength
:
You can simply filter out the tokens you don't wish to give through the pipeline:
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
.setMinLength(4)
.setMaxLength(10)
Hi,
I'm using
com.johnsnowlabs.nlp-2.2.2
with spark-2.4.4 to process some articles. In those articles, there are some very long words I'm not interested in and which slows down the POS tagging a lot. I would to like to exclude them after the tokenization and before the POSTagging.I tried to write the smaller code to reproduce my issues
I have this error when I transform
token_filtered
It works fine if I directly fit and transform
token
innormalizer
It seems that during the
explode
/groupBy
/collect_list
, some information is lost, but the schema and data looks the same.Any idea what is happening ?
Thank you