JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.88k stars 711 forks source link

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

Closed maziyarpanahi closed 2 months ago

maziyarpanahi commented 3 months ago

Discussed in https://github.com/JohnSnowLabs/spark-nlp/discussions/14362

Originally posted by **SidWeng** August 8, 2024 I use the following pipeline with [BioBERT Sentence Embeddings](https://sparknlp.org/2020/09/19/sent-biobert_clinical_base_cased.html). However, it throws `The columns of A don't match the number of elements of x. A: 768, x: 1536` when execute pipeline.fit(). I trace the code and find out the dimension of `randMatrix` used by `BucketedRandomProjectLSHModel` is determined by `DatasetUtils.getNumFeatures()`. Does it imply something wrong with the data I feed into fit() ? The data I feed is a DataFrame with a String column code and a String column text. The longest length of text is 229. ``` val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val document_similarity_ranker = new DocumentSimilarityRankerApproach() .setInputCols("sentence_embeddings") .setOutputCol("doc_similarity_rankings") .setSimilarityMethod("brp") .setNumberOfNeighbours(1) .setBucketLength(2.0) .setNumHashTables(3) .setVisibleDistances(true) .setIdentityRanking(false) val document_similarity_ranker_finisher = new DocumentSimilarityRankerFinisher() .setInputCols("doc_similarity_rankings") .setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors") .setExtractNearestNeighbor(true) val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentenceDetector, embeddings, document_similarity_ranker, document_similarity_ranker_finisher )) ``` ``` ``` 24/08/08 03:19:13.581 [task-result-getter-3] WARN o.a.spark.scheduler.TaskSetManager - Lost task 7.2 in stage 10.0 (TID 370) (10.0.0.12 executor 4): org.apache.spark.SparkException: Failed to execute user defined function (LSHModel$$Lambda$5263/1056329262: (struct,values:array>) => array,values:array>>) at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:177) at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32) at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:670) at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 768, x: 1536 at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:579) at org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction(BucketedRandomProjectionLSH.scala:87) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... 22 more ``` ```
maziyarpanahi commented 3 months ago

thanks @SidWeng, we will look into this

SidWeng commented 3 months ago

@maziyarpanahi I found the root cause but I'm guessing it is not a bug, please take a look https://github.com/JohnSnowLabs/spark-nlp/discussions/14362#discussioncomment-10344195

danilojsl commented 3 months ago

Hi @SidWeng

Yes, that's exactly the root cause. We are working on adding a parameter to DocumentSimilarityRankerApproach to choose the aggregation method when a document has multiple sentences. I hope we can include it in the next release.

maziyarpanahi commented 3 months ago

Hi @SidWeng @danilojsl

I totally missed that you are using SentenceDetector. The DocumentSimilarityRankerApproach annotator is designed to only deal with the document level embeddings.

Until we implement a simple averaging to put everything together, here are a few options: