Closed kavyapraveen closed 1 month ago
Hi,
You can set any place you (your Spark app) have permission to read/write via Spark NLP Config: https://github.com/JohnSnowLabs/spark-nlp#spark-nlp-configuration
The config you need to set is cache_folder
which by default it either points to a user's home directory, and if it doesn't exist it goes to the /root. But you can set this to a path that has full permission and it will download/load from there.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days
Is there an existing issue for this?
Who can help?
No response
What are you working on?
we are trying check senetence similarity between two files . here is the code we are using documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)
tokenizer = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')
bertEmbeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["sentence",'token'])\ .setOutputCol("bert")\ .setCaseSensitive(False)\ .setPoolingLayer(0)
embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "bert"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE")
embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["sentence_embeddings","bert"]) \ .setOutputCols("sentence_embeddings_vectors", "bert_vectors") \ .setOutputAsVector(True)\ .setCleanAnnotations(False)
explodeVectors = SQLTransformer() \ .setStatement("SELECT EXPLODE(sentence_embeddings_vectors) AS features, * FROM THIS")
vectorNormalizer = Normalizer() \ .setInputCol("features") \ .setOutputCol("normFeatures") \ .setP(1.0)
similartyChecker = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=6.0,numHashTables=10)
pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer, bertEmbeddings, embeddingsSentence, embeddingsFinisher, explodeVectors, vectorNormalizer, similartyChecker])
pipelineModel = pipeline.fit(primaryCorpus) primaryDF = pipelineModel.transform(primaryCorpus) secondaryDF = pipelineModel.transform(secondaryCorpus)
dfA = primaryDF.select("text","features","normFeatures").withColumn("lookupKey", md5("text")).withColumn("id",monotonically_increasing_id()) dfB = secondaryDF.select("text","features","normFeatures").withColumn("id",monotonically_increasing_id())
pipelineModel.stages[8].approxSimilarityJoin(dfA, dfB, 100, distCol="distance")\ .where(col("datasetA.id") == col("datasetB.id")) \ .select(col("datasetA.text").alias("idA"), \ col("datasetB.text").alias("idB"), \ col("distance")).show()
using databricks version 13.2 imported spark-nlp and maven reporsitory
Current Behavior
Currently the packages are throwing error because they trying put call in root s3 bucket which is not supported . Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.ExceptionInInitializerError
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://***-prod-databricks-root.s3.us-east-1.amazonaws.com nvirginia-prod/2820278049549475/root/cache_pretrained/
Expected Behavior
package should not throw access denied .. or we need to specify where files could be written to
Steps To Reproduce
from pyspark.sql.types import StringType
Spark NLP
import sparknlp from sparknlp.pretrained import PretrainedPipeline from sparknlp.annotator import from sparknlp.base import
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence = SentenceDetector()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setExplodeSentences(False)
tokenizer = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')
bertEmbeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["sentence",'token'])\ .setOutputCol("bert")\ .setCaseSensitive(False)\ .setPoolingLayer(0)
embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "bert"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE")
embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["sentence_embeddings","bert"]) \ .setOutputCols("sentence_embeddings_vectors", "bert_vectors") \ .setOutputAsVector(True)\ .setCleanAnnotations(False)
explodeVectors = SQLTransformer() \ .setStatement("SELECT EXPLODE(sentence_embeddings_vectors) AS features, * FROM THIS")
vectorNormalizer = Normalizer() \ .setInputCol("features") \ .setOutputCol("normFeatures") \ .setP(1.0)
similartyChecker = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=6.0,numHashTables=10)
pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer, bertEmbeddings, embeddingsSentence, embeddingsFinisher, explodeVectors, vectorNormalizer, similartyChecker])
pipelineModel = pipeline.fit(primaryCorpus) primaryDF = pipelineModel.transform(primaryCorpus) secondaryDF = pipelineModel.transform(secondaryCorpus)
dfA = primaryDF.select("text","features","normFeatures").withColumn("lookupKey", md5("text")).withColumn("id",monotonically_increasing_id()) dfB = secondaryDF.select("text","features","normFeatures").withColumn("id",monotonically_increasing_id())
pipelineModel.stages[8].approxSimilarityJoin(dfA, dfB, 100, distCol="distance")\ .where(col("datasetA.id") == col("datasetB.id")) \ .select(col("datasetA.text").alias("idA"), \ col("datasetB.text").alias("idB"), \ col("distance")).show()
Spark NLP version and Apache Spark
spark - 3.4.0 com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
Type of Spark Application
No response
Java Version
No response
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
No response