JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.81k stars 708 forks source link

functional API in SparkNLP? #6880

Closed chengyineng38 closed 2 years ago

chengyineng38 commented 2 years ago

Is it possible to have 2 input layers and concatenate them both in SparkNLP, just as tf.keras functional API? There doesn't seem to be documentation on this. The use case is to pass 2 different text columns into a classifier.

maziyarpanahi commented 2 years ago

Hi,

The inputCols in each annotator must be the exact number and the exact annotatorType that it is already set. You have to find a way to either explode your text columns into rows (with unique ids so later you can easily identify them) or go column by column. (not possible to have multiple text columns in Spark NLP, it has to be 1 text column and it all starts with DocumentAssembler - The scaling/distribution is actually in the number of rows, not the number of columns)

PS: you can slice your DataFrame into multiple DataFrames with each 1 text column, use .transform on each and then merge/union them all together. Just one possible option

chengyineng38 commented 2 years ago

Thank you!

chengyineng38 commented 2 years ago

@maziyarpanahi Follow-up on your PS note above: Which .transform method are you referring to? I assume the merge/union is to then combine the embeddings from different columns into one column -- in that case, how would concatenating different columns into one be different? In my case, one of the columns is actually an integer column, containing sector_id.

For context, ClassifierDL ( https://nlp.johnsnowlabs.com/docs/en/training#classifierdlapproach) is the API I am hoping to use here.

maziyarpanahi commented 2 years ago

It depends on what exactly you are trying to do with multiple columns. Are these isolated/separated text columns which each can be transformed throughout the pipeline stages and at the end, you end up with outputCol from each col of the DataFrame. In this case, yes, slicing the DataFrame by only selecting that column and then union the results makes sense.

But you are talking about embeddings that will be used on each col and their downstream tasks, so no need to contact anything. Each column will be processed, embeddings, go through the classification model and you get the result for each col. If the columns are related, you can just contact the text before all this into 1 col and just use that.

chengyineng38 commented 2 years ago

My task is more related to your second paragraph -- embeddings will be generated based on each column ( 3 columns in total) and these 3 columns (2 text and 1 integer) should be part of the classifierDL. I am confused by what you said about "Each column will be processed, embeddings, go through the classification model" -- Since this DocumentAssembler starts with one column, why don't we need to concat all the columns to allow the downstream processing? Sounds like there're no choices, but to concat? Greatly appreciate your time to help address my gap in understanding here. Thanks!

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)
maziyarpanahi commented 2 years ago

I think what I suggest at the end is what you are looking for. The best is to concatenate your two text columns (we also call this enriching the training data), use it to train and this way your model has seen more data during the training.

I would make a test dataset, train a model on the merged two columns, evaluate it on the test dataset, then train separately on each column which you'll end up with 2 models, so you evaluate those 2 as well. Then you can be 100% sure what you did with merging your two textual columns was a right decision.

chengyineng38 commented 2 years ago

Thanks, @maziyarpanahi for taking the time in responding! We will try concatenating those 2 text columns, but it does also confirm that the SparkNLP API does not accommodate a pipeline with different feature types. E.g. if I have a movie rating classification project, that has reviews (text), movie duration (time in numbers), production budget ($$), it would be hard to pass all these features into a SparkNLP pipeline.

maziyarpanahi commented 2 years ago

@chengyineng38 that's true, I think there is a feature assembler in the licensed library to do this but not in the open-source. Maybe that can be a feature request so we can have it on the roadmap or someone contributes it. (this is different than accepting more than 2 text columns, for feature engineering, we can add all the complementary columns as an array and modify the latest embeddings based on those features)

chengyineng38 commented 2 years ago

Okay, I will submit a feature request to allow variable feature types in the open-source implementation! Thanks again!