JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.8k stars 707 forks source link

spark-nlp in databricks writing to root s3 in cluster #14139

Closed kavyapraveen closed 1 month ago

kavyapraveen commented 7 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

we are trying check senetence similarity between two files . here is the code we are using documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)

tokenizer = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')

bertEmbeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["sentence",'token'])\ .setOutputCol("bert")\ .setCaseSensitive(False)\ .setPoolingLayer(0)

embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "bert"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["sentence_embeddings","bert"]) \ .setOutputCols("sentence_embeddings_vectors", "bert_vectors") \ .setOutputAsVector(True)\ .setCleanAnnotations(False)

explodeVectors = SQLTransformer() \ .setStatement("SELECT EXPLODE(sentence_embeddings_vectors) AS features, * FROM THIS")

vectorNormalizer = Normalizer() \ .setInputCol("features") \ .setOutputCol("normFeatures") \ .setP(1.0)

similartyChecker = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=6.0,numHashTables=10)

pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer, bertEmbeddings, embeddingsSentence, embeddingsFinisher, explodeVectors, vectorNormalizer, similartyChecker])

pipelineModel = pipeline.fit(primaryCorpus) primaryDF = pipelineModel.transform(primaryCorpus) secondaryDF = pipelineModel.transform(secondaryCorpus)

dfA = primaryDF.select("text","features","normFeatures").withColumn("lookupKey", md5("text")).withColumn("id",monotonically_increasing_id()) dfB = secondaryDF.select("text","features","normFeatures").withColumn("id",monotonically_increasing_id())

pipelineModel.stages[8].approxSimilarityJoin(dfA, dfB, 100, distCol="distance")\ .where(col("datasetA.id") == col("datasetB.id")) \ .select(col("datasetA.text").alias("idA"), \ col("datasetB.text").alias("idB"), \ col("distance")).show()

using databricks version 13.2 imported spark-nlp and maven reporsitory

Current Behavior

Currently the packages are throwing error because they trying put call in root s3 bucket which is not supported . Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.ExceptionInInitializerError

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://***-prod-databricks-root.s3.us-east-1.amazonaws.com nvirginia-prod/2820278049549475/root/cache_pretrained/

Expected Behavior

package should not throw access denied .. or we need to specify where files could be written to

Steps To Reproduce

from pyspark.sql.types import StringType

Spark NLP

import sparknlp from sparknlp.pretrained import PretrainedPipeline from sparknlp.annotator import from sparknlp.base import

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)

tokenizer = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')

bertEmbeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["sentence",'token'])\ .setOutputCol("bert")\ .setCaseSensitive(False)\ .setPoolingLayer(0)

embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "bert"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["sentence_embeddings","bert"]) \ .setOutputCols("sentence_embeddings_vectors", "bert_vectors") \ .setOutputAsVector(True)\ .setCleanAnnotations(False)

explodeVectors = SQLTransformer() \ .setStatement("SELECT EXPLODE(sentence_embeddings_vectors) AS features, * FROM THIS")

vectorNormalizer = Normalizer() \ .setInputCol("features") \ .setOutputCol("normFeatures") \ .setP(1.0)

similartyChecker = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=6.0,numHashTables=10)

pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer, bertEmbeddings, embeddingsSentence, embeddingsFinisher, explodeVectors, vectorNormalizer, similartyChecker])

pipelineModel = pipeline.fit(primaryCorpus) primaryDF = pipelineModel.transform(primaryCorpus) secondaryDF = pipelineModel.transform(secondaryCorpus)

dfA = primaryDF.select("text","features","normFeatures").withColumn("lookupKey", md5("text")).withColumn("id",monotonically_increasing_id()) dfB = secondaryDF.select("text","features","normFeatures").withColumn("id",monotonically_increasing_id())

pipelineModel.stages[8].approxSimilarityJoin(dfA, dfB, 100, distCol="distance")\ .where(col("datasetA.id") == col("datasetB.id")) \ .select(col("datasetA.text").alias("idA"), \ col("datasetB.text").alias("idB"), \ col("distance")).show()

Spark NLP version and Apache Spark

spark - 3.4.0 com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2

Type of Spark Application

No response

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 7 months ago

Hi,

You can set any place you (your Spark app) have permission to read/write via Spark NLP Config: https://github.com/JohnSnowLabs/spark-nlp#spark-nlp-configuration

The config you need to set is cache_folder which by default it either points to a user's home directory, and if it doesn't exist it goes to the /root. But you can set this to a path that has full permission and it will download/load from there.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days