Closed chengyineng38 closed 2 years ago
Hi,
The inputCols
in each annotator must be the exact number and the exact annotatorType that it is already set. You have to find a way to either explode your text columns into rows (with unique ids so later you can easily identify them) or go column by column. (not possible to have multiple text columns in Spark NLP, it has to be 1 text column and it all starts with DocumentAssembler - The scaling/distribution is actually in the number of rows, not the number of columns)
PS: you can slice your DataFrame into multiple DataFrames with each 1 text column, use .transform on each and then merge/union them all together. Just one possible option
Thank you!
@maziyarpanahi Follow-up on your PS note above: Which .transform method are you referring to? I assume the merge/union is to then combine the embeddings from different columns into one column -- in that case, how would concatenating different columns into one be different? In my case, one of the columns is actually an integer column, containing sector_id.
For context, ClassifierDL ( https://nlp.johnsnowlabs.com/docs/en/training#classifierdlapproach) is the API I am hoping to use here.
It depends on what exactly you are trying to do with multiple columns. Are these isolated/separated text columns which each can be transformed throughout the pipeline stages and at the end, you end up with outputCol from each col of the DataFrame. In this case, yes, slicing the DataFrame by only selecting that column and then union the results makes sense.
But you are talking about embeddings that will be used on each col and their downstream tasks, so no need to contact anything. Each column will be processed, embeddings, go through the classification model and you get the result for each col. If the columns are related, you can just contact the text before all this into 1 col and just use that.
My task is more related to your second paragraph -- embeddings will be generated based on each column ( 3 columns in total) and these 3 columns (2 text and 1 integer) should be part of the classifierDL. I am confused by what you said about "Each column will be processed, embeddings, go through the classification model" -- Since this DocumentAssembler starts with one column, why don't we need to concat all the columns to allow the downstream processing? Sounds like there're no choices, but to concat? Greatly appreciate your time to help address my gap in understanding here. Thanks!
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = ClassifierDLApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("category") \
.setLabelColumn("label") \
.setBatchSize(64) \
.setMaxEpochs(20) \
.setLr(5e-3) \
.setDropout(0.5)
I think what I suggest at the end is what you are looking for. The best is to concatenate your two text columns (we also call this enriching the training data), use it to train and this way your model has seen more data during the training.
I would make a test dataset, train a model on the merged two columns, evaluate it on the test dataset, then train separately on each column which you'll end up with 2 models, so you evaluate those 2 as well. Then you can be 100% sure what you did with merging your two textual columns was a right decision.
Thanks, @maziyarpanahi for taking the time in responding! We will try concatenating those 2 text columns, but it does also confirm that the SparkNLP API does not accommodate a pipeline with different feature types. E.g. if I have a movie rating classification project, that has reviews (text), movie duration (time in numbers), production budget ($$), it would be hard to pass all these features into a SparkNLP pipeline.
@chengyineng38 that's true, I think there is a feature assembler in the licensed library to do this but not in the open-source. Maybe that can be a feature request so we can have it on the roadmap or someone contributes it. (this is different than accepting more than 2 text columns, for feature engineering, we can add all the complementary columns as an array and modify the latest embeddings based on those features)
Okay, I will submit a feature request to allow variable feature types in the open-source implementation! Thanks again!
Is it possible to have 2 input layers and concatenate them both in SparkNLP, just as tf.keras functional API? There doesn't seem to be documentation on this. The use case is to pass 2 different text columns into a classifier.