JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.84k stars 711 forks source link

The sparkNLP model cannot be trained #13592

Closed yhp519 closed 1 year ago

yhp519 commented 1 year ago

When I use sparkNLP to train a Chinese text classification model, both loss and acc remain unchanged, and it takes a long time to load before training the model.

The training log is as follows:

2023-03-01 23:03:07.163680: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:148] Reading SavedModel debug info (if present) from: /tmp/8f2cf44b7040_classifier_dl8801024446678227611
2023-03-01 23:03:07.185818: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2023-03-01 23:03:07.252743: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /tmp/8f2cf44b7040_classifier_dl8801024446678227611
2023-03-01 23:03:07.263910: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 113989 microseconds.
Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 97943
Epoch 1/30 - 12.69s - loss: 620.8557 - acc: 0.90963256 - batches: 1531
Epoch 2/30 - 12.49s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 3/30 - 12.53s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 4/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 5/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 6/30 - 12.47s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 7/30 - 12.43s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 8/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 9/30 - 12.47s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 10/30 - 12.51s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 11/30 - 12.46s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 12/30 - 12.41s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 13/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 14/30 - 12.52s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 15/30 - 12.43s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 16/30 - 12.39s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 17/30 - 12.49s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 18/30 - 12.47s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 19/30 - 12.45s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 20/30 - 12.41s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 21/30 - 12.43s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 22/30 - 12.45s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 23/30 - 12.43s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 24/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 25/30 - 12.45s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 26/30 - 12.44s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 27/30 - 12.46s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 28/30 - 12.51s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 29/30 - 12.67s - loss: 620.5972 - acc: 0.90963256 - batches: 1531
Epoch 30/30 - 12.57s - loss: 620.5972 - acc: 0.90963256 - batches: 1531

My code:

import sparknlp
from sparknlp.annotator import *
from sparknlp.annotator.embeddings import BertSentenceEmbeddings
from sparknlp.base import *
from pyspark.ml import Pipeline

spark = sparknlp.start()
trainDataset = spark.read.option('header', True).option('sep', '\t').csv('./data2.csv')
document = DocumentAssembler().setInputCol("content").setOutputCol("document")
sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
use = BertSentenceEmbeddings().load('./sbert_chinese_qmc_finance_v1').setInputCols(["sentence"]).setOutputCol(
    "sentence_embeddings")
classification = SentimentDLApproach().setInputCols(['sentence_embeddings']).setOutputCol(
    "class").setLabelColumn("label").setLr(1e-3).setEnableOutputLogs(True)
pipeline = Pipeline(stages=[document, sentence, use, classification])
model = pipeline.fit(trainDataset)
model.write().overwrite().save('./model2')

data2.csv:

label content 0 我的第一个句子 1 我的第二个句子

My env:

spark-nlp==4.3.1 pyspark==3.2.3

maziyarpanahi commented 1 year ago

Hi,

The batch size seems to be a bit high. I suggest playing around with the parameters specially:

Then you can have a baseline as how changing these will help converging your model

yhp519 commented 1 year ago

Hi,

The batch size seems to be a bit high. I suggest playing around with the parameters specially:

  • .setBatchSize(8)
  • .setLr(0.0005)

Then you can have a baseline as how changing these will help converging your model

I tried to set the batchsize and lr to smaller values as you said, but the problem still exists.

2023-03-03 11:33:11.494679: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 113473 microseconds.
Training started - epochs: 50 - learning_rate: 0.005 - batch_size: 4 - training_examples: 94390
Epoch 1/50 - 146.36s - loss: 9490.271 - acc: 0.9105289 - batches: 23598
Epoch 2/50 - 146.36s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 3/50 - 146.52s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 4/50 - 146.08s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 5/50 - 146.07s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 6/50 - 145.88s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 7/50 - 145.91s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 8/50 - 146.08s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 9/50 - 146.72s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 10/50 - 147.31s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 11/50 - 146.32s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 12/50 - 146.42s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 13/50 - 146.48s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 14/50 - 146.84s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 15/50 - 146.25s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 16/50 - 147.23s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
Epoch 17/50 - 146.57s - loss: 9489.971 - acc: 0.9105289 - batches: 23598
maziyarpanahi commented 1 year ago

Have you tried to use the model for prediction? I think given the quality of the dataset and the word embeddings this might be the highest it can be. You can switch embeddings just to see how your model converge. (I suggest trying these models just foe your own testing https://nlp.johnsnowlabs.com/models?type=model&task=Embeddings&annotator=BertSentenceEmbeddings&edition=Spark+NLP&language=xx)

yhp519 commented 1 year ago

First of all, thank you for your patient answer.

I found the problem. There is something wrong with the distribution ratio of 0 and 1 labels in my data. Under normal circumstances, the data with 0 labels accounts for a very small proportion of regular data. I simulated a positive ratio of about 1:9. , may be the cause of the problem, thanks!

maziyarpanahi commented 1 year ago

This is a great finding! I was actually going to suggest data imbalance, require augmentation, etc. in your training examples. (when it is skewed on some labels compare to others, the model usually stops converging. We use something that doesn't allow overfitting or at least not that fast and it seems it stops learning since there is nothing to learn/challenge)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days