Open NSManogna opened 3 months ago
Could you please provide the actual code you used to start SparkSession, the pipeline, so we can reproduce it?
The zip file i attached has .ipynb file which consist of the code
Please include the code here or on Google Colab. We are not allowed to download and open zip files for security reasons.
You just need to follow the template, nothing more and nothing less. The issue template is designed based on years of experience.
import sparknlp from sparknlp.base import from sparknlp.annotator import from pyspark.ml import Pipeline
spark = sparknlp.start()
documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")
tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")
POSTag = PerceptronModel.pretrained()\ .setInputCols("document", "token")\ .setOutputCol("pos")
chunker = Chunker()\ .setInputCols("sentence", "pos")\ .setOutputCol("chunk")
embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings")
ner_model =NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(10) \ .setLr(0.001) \ .setPo(0.005)\ .setBatchSize(8) \ .setDropout(0.5) \ .setValidationSplit(0.2)
ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("entities")
c_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, POSTag ])
import pandas as pd import ast from pyspark.sql.functions import explode, col
df=pd.read_csv("pii_dataset.csv")
df1 = spark.createDataFrame(df)
f_model=c_pipeline.fit(df1) result = f_model.transform(df1)
df_new = df1.join(result.select("text", "pos.result"), on="text", how="left") df_new = df_new.withColumnRenamed("result", "pos_tags")
import ast
df_new2=df_new.toPandas() df_new2['tokens'] = df_new2['tokens'].apply(ast.literal_eval) df_new2['labels'] = df_new2['labels'].apply(ast.literal_eval)
selected_df=spark.createDataFrame(df_new2) rows_as_dicts = selected_df.rdd.map(lambda row: row.asDict()).collect()
def convert_to_conll(sentences): conll_lines = [] for sentence in sentences: tokens, labels, pos_tags = sentence['tokens'], sentence['labels'], sentence['pos_tags'] for token, label ,pos_tag in zip(tokens, labels,pos_tags ): conll_lines.append(f"{token} {postag} \t {label}") conll_lines.append("") # Blank line to separate sentences return "\n".join(conll_lines)
conll_data = convert_to_conll(rows_as_dicts)
with open('annotations.conll', 'w') as file: file.write(conll_data)
print("Dataset converted to CoNLL format and saved as 'annotations.conll'.")
nerpipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter])
from sparknlp.training import CoNLL
conll_instance = CoNLL()
training_data = conll_instance.readDataset(spark=spark, path ='annotations.conll')
model = nerpipeline.fit(training_data)
Is there an existing issue for this?
Who can help?
No response
What are you working on?
I am training NerDLApproach for custom entities. when I increase the size of training data. i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused
Current Behavior
i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused
Expected Behavior
To get trained and model training should complete and then can be used for NER of new text
Steps To Reproduce
CoNll.zip
Spark NLP version and Apache Spark
i have launched johnsnowlab on ec2 instance of m5.2xlarge type
Type of Spark Application
Python Application
Java Version
No response
Java Home Directory
No response
Setup and installation
sparkNLP in johnsnowlab
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
please let me know if any information is needed