JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.86k stars 711 forks source link

how to feed word and sentence embeddings to an MLlib classifier? #541

Closed CyborgDroid closed 4 years ago

CyborgDroid commented 5 years ago

The output of the bert sentence embeddings are a list of floats, this is not accepted by any mllib classification model.

The output of a VectorAssembler is a udt (user defined type)

assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

A simple logistic regression function will not take a list of floats but will accept the udt above.

from pyspark.ml.classification import LogisticRegression
LR = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=15)
LR_model = LR.fit(train)

Besides the formatting issue with list of floats per sentence, the output of all word embeddings (bert and glove) is a list of floats per word. How can this be fed to a LogisticRegression classifier that is predicting by sentence? The only word embedding model that has a sentence_embeddings output is bert (and it's currently broken anyways, see other ticket).

CyborgDroid commented 5 years ago

Here is the error when trying to feed sentence embeddings to LogisticRegression:

Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>
 but was actually array<float>.
maziyarpanahi commented 5 years ago

Our WordEmbeddings and BertEmbeddings are meant to be used by TensorFlow graph and BiLSTM+CNN algorithm which is not compatible with Spark ML. For instance, the Word2Vec in Spark is really naive when it comes to a sentence. It averages all the vectors for each word and just outputs an array of vector, one for each word. It would be very helpful to be able to do document classification by the use of GloVe, BERT, or FastText in the future with the help of Spark NLP.

I'll label this as a feature request pending further research whether we support Spark ML functions as they are mostly Machine Learning algorithms at the moment or extend the use of embeddings with our own TensorFLow graphs that support the same format as it is. (deep learning)

I've been working on this myself actually and been waiting for TF 2.0 to be at least general release before adding anything more on that side.

srowen commented 5 years ago

BTW this should be easy to fix; you just need to make a dense vector out of the float array with Vectors.dense in Spark.

maziyarpanahi commented 5 years ago

Hi @srowen Exactly! In the upcoming release from this pull request(https://github.com/JohnSnowLabs/spark-nlp/pull/638), the documentation has been updated to achieve this:

https://github.com/JohnSnowLabs/spark-nlp/blob/b95ac300d4fe9b0c6ebdb29ef774d55e672f3067/docs/en/annotators.md#sentenceembeddings

import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))

PS: We have a new annotator SentenceEmbeddings to get the sentence/document embeddings from word embeddings to feed Spark ML/MLlib. (https://github.com/JohnSnowLabs/spark-nlp/pull/638)

maziyarpanahi commented 4 years ago

I am closing this issue. We have now: