kavyapraveen commented 7 months ago

Is there an existing issue for this?

[ ] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

we are trying check senetence similarity between two files . here is the code we are using documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)

tokenizer = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')

bertEmbeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["sentence",'token'])\ .setOutputCol("bert")\ .setCaseSensitive(False)\ .setPoolingLayer(0)

embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "bert"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["sentence_embeddings","bert"]) \ .setOutputCols("sentence_embeddings_vectors", "bert_vectors") \ .setOutputAsVector(True)\ .setCleanAnnotations(False)

explodeVectors = SQLTransformer() \ .setStatement("SELECT EXPLODE(sentence_embeddings_vectors) AS features, * FROM THIS")

vectorNormalizer = Normalizer() \ .setInputCol("features") \ .setOutputCol("normFeatures") \ .setP(1.0)

similartyChecker = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=6.0,numHashTables=10)

pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer, bertEmbeddings, embeddingsSentence, embeddingsFinisher, explodeVectors, vectorNormalizer, similartyChecker])

pipelineModel = pipeline.fit(primaryCorpus) primaryDF = pipelineModel.transform(primaryCorpus) secondaryDF = pipelineModel.transform(secondaryCorpus)

dfA = primaryDF.select("text","features","normFeatures").withColumn("lookupKey", md5("text")).withColumn("id",monotonically_increasing_id()) dfB = secondaryDF.select("text","features","normFeatures").withColumn("id",monotonically_increasing_id())

pipelineModel.stages[8].approxSimilarityJoin(dfA, dfB, 100, distCol="distance")\ .where(col("datasetA.id") == col("datasetB.id")) \ .select(col("datasetA.text").alias("idA"), \ col("datasetB.text").alias("idB"), \ col("distance")).show()

using databricks version 13.2 imported spark-nlp and maven reporsitory

Current Behavior

Currently the packages are throwing error because they trying put call in root s3 bucket which is not supported . Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.ExceptionInInitializerError

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://***-prod-databricks-root.s3.us-east-1.amazonaws.com nvirginia-prod/2820278049549475/root/cache_pretrained/

Expected Behavior

package should not throw access denied .. or we need to specify where files could be written to

Steps To Reproduce

from pyspark.sql.types import StringType

Spark NLP

import sparknlp from sparknlp.pretrained import PretrainedPipeline from sparknlp.annotator import from sparknlp.base import

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)