JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.8k stars 708 forks source link

Py4JError: An error occurred while calling o9368.fit #14375

Open NSManogna opened 3 weeks ago

NSManogna commented 3 weeks ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

I am training NerDLApproach for custom entities. when I increase the size of training data. i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Current Behavior

i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Expected Behavior

To get trained and model training should complete and then can be used for NER of new text

Steps To Reproduce

CoNll.zip

Spark NLP version and Apache Spark

i have launched johnsnowlab on ec2 instance of m5.2xlarge type

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

sparkNLP in johnsnowlab

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

please let me know if any information is needed

maziyarpanahi commented 3 weeks ago

Could you please provide the actual code you used to start SparkSession, the pipeline, so we can reproduce it?

NSManogna commented 3 weeks ago

The zip file i attached has .ipynb file which consist of the code

maziyarpanahi commented 3 weeks ago

Please include the code here or on Google Colab. We are not allowed to download and open zip files for security reasons.

You just need to follow the template, nothing more and nothing less. The issue template is designed based on years of experience.

NSManogna commented 3 weeks ago

import sparknlp from sparknlp.base import from sparknlp.annotator import from pyspark.ml import Pipeline

spark = sparknlp.start()

documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")

tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")

POSTag = PerceptronModel.pretrained()\ .setInputCols("document", "token")\ .setOutputCol("pos")

chunker = Chunker()\ .setInputCols("sentence", "pos")\ .setOutputCol("chunk")

embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings")

ner_model =NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(10) \ .setLr(0.001) \ .setPo(0.005)\ .setBatchSize(8) \ .setDropout(0.5) \ .setValidationSplit(0.2)

ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("entities")

c_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, POSTag ])

import pandas as pd import ast from pyspark.sql.functions import explode, col

df = spark.read.csv("pii_dataset.csv", header=True, inferSchema=True)

df=pd.read_csv("pii_dataset.csv")

df = df.head(1000)

df1 = spark.createDataFrame(df)

f_model=c_pipeline.fit(df1) result = f_model.transform(df1)

result.select( explode(col("chunk.result")).alias("chunk_tag")).show(truncate=False)

df_new = df1.join(result.select("text", "pos.result"), on="text", how="left") df_new = df_new.withColumnRenamed("result", "pos_tags")

df_new1 = df_new.join(result.select("text", "chunk.result"), on="text", how="left")

df_new1 = df_new1.withColumnRenamed("result", "chunks")

import ast

df_new2=df_new.toPandas() df_new2['tokens'] = df_new2['tokens'].apply(ast.literal_eval) df_new2['labels'] = df_new2['labels'].apply(ast.literal_eval)

selected_df=spark.createDataFrame(df_new2) rows_as_dicts = selected_df.rdd.map(lambda row: row.asDict()).collect()

def convert_to_conll(sentences): conll_lines = [] for sentence in sentences: tokens, labels, pos_tags = sentence['tokens'], sentence['labels'], sentence['pos_tags'] for token, label ,pos_tag in zip(tokens, labels,pos_tags ): conll_lines.append(f"{token} {postag} \t {label}") conll_lines.append("") # Blank line to separate sentences return "\n".join(conll_lines)

conll_data = convert_to_conll(rows_as_dicts)

with open('annotations.conll', 'w') as file: file.write(conll_data)

print("Dataset converted to CoNLL format and saved as 'annotations.conll'.")

nerpipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter])

from sparknlp.training import CoNLL

conll_instance = CoNLL()

training_data = conll_instance.readDataset(spark=spark, path ='annotations.conll')

model = nerpipeline.fit(training_data)