JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.8k stars 707 forks source link

NerDLApproach needs excessive Driver memory to be trained #1060

Closed sdabbour-stratio closed 3 years ago

sdabbour-stratio commented 3 years ago

NerDLApproach needs excessive Driver memory to be trained, in order to train a decent dataset (with considerable amount of samples) a huge amount of RAM is needed for the driver; no simple way to analyze or calculate the needed memory.

Description

I am trying to train a NerDL using Spark NLP, I have a dataset of 100000 records, about 170MB and in order to train the model correctly without suffering from OutOfMemory issues, I had to provide the driver machine with at least 600GB RAM.

It is an issue because it is not logical to have a Spark cluster and trying to advantage the distibuted storage/processing and still need an excessive memory for a machine; in this case maybe traditional training for the model is more efficient.

Expected Behavior

I expect for the model training to be distributed, and for memory usage of the cluster to be optimized; not to have a driver with crazy memory.

Current Behavior

Training the pipeline (Stage 1: BertEmbeddings pretrained, Stage 2: NerDLApproach) with a dataset of 100,000 lines (app. 175MB) requires at least a 600GB RAM for Spark driver!

Possible Solution

Distribute the NerDL training.

Steps to Reproduce

  1. Setup SparkNLP
  2. Generate a dataset (at least 30000 records)
  3. Load the dataset
  4. Run the following pipeline

` val glove_embeddings = BertEmbeddings.pretrained(name = "bert_multi_cased", lang = "xx"). setInputCols("document", "token").setOutputCol("embeddings")

val nerTagger = new NerDLApproach().
  setInputCols("sentence", "token", "embeddings").
  setLabelColumn("label").
  setOutputCol("ner").
  setMaxEpochs(numEpochs).
  setLr(0.001f).
  setPo(0.005f).
  setBatchSize(10).
  setRandomSeed(0).
  setVerbose(1).
  setValidationSplit(0.2f).
  setEvaluationLogExtended(true).
  setEnableOutputLogs(true).
  setIncludeConfidence(true).
  setGraphFolder(graphFolder)

val ner_pipeline = new Pipeline().setStages(
  Array(
    glove_embeddings,
    nerTagger
  )
)

val ner_model = ner_pipeline.fit(training_data)`

Context

I am trying to train a model to be used in a NLP project but it is not possible to have such driver machine for this dataset.

Your Environment

maziyarpanahi commented 3 years ago

Hi,

lucianoalzugaraydc commented 3 years ago

Hi everyone, I have the same issue. I realized that this only happens when I use a custom tensorflow graph. I had two model trained in the same way with the same data, one have to found only two entities and the other have to found ~5000 entities. Running the first example, the execution with a memory of 16GB is success. For other hand, running the dataset with more than 5000 entities to recognize takes more than 198GB and still insufficient. @sdabbour-stratio what is the size of your tf graph?

maziyarpanahi commented 3 years ago

Hi, Please have a look at the release notes of 2.6.3, we did introduce a new param to NerDLApproach to fit each Epoch to the Driver memory with the need of increasing the memory if it's not possible. (Obviously the Epoch takes longer a bit if the param is enabled)

PS: the size of the unique chars, embeddings and total tags (in your case 5000) directly impact the memory as it has to learn 5000 unique classes instead of 10 or 30 which are normally the size for NER tags. (Totally doable though)

maziyarpanahi commented 3 years ago

We do have a new param .setEnableMemoryOptimizer(True) in NerDLApproach to optimize based on the available memory in the Driver.

pip install spark-nlp==2.7.0 --upgrade