NerDLApproach needs excessive Driver memory to be trained

sdabbour-stratio commented 3 years ago

NerDLApproach needs excessive Driver memory to be trained, in order to train a decent dataset (with considerable amount of samples) a huge amount of RAM is needed for the driver; no simple way to analyze or calculate the needed memory.

Description

I am trying to train a NerDL using Spark NLP, I have a dataset of 100000 records, about 170MB and in order to train the model correctly without suffering from OutOfMemory issues, I had to provide the driver machine with at least 600GB RAM.

It is an issue because it is not logical to have a Spark cluster and trying to advantage the distibuted storage/processing and still need an excessive memory for a machine; in this case maybe traditional training for the model is more efficient.

Expected Behavior

I expect for the model training to be distributed, and for memory usage of the cluster to be optimized; not to have a driver with crazy memory.

Current Behavior

Training the pipeline (Stage 1: BertEmbeddings pretrained, Stage 2: NerDLApproach) with a dataset of 100,000 lines (app. 175MB) requires at least a 600GB RAM for Spark driver!

Possible Solution

Distribute the NerDL training.

Steps to Reproduce

Setup SparkNLP
Generate a dataset (at least 30000 records)
Load the dataset
Run the following pipeline

` val glove_embeddings = BertEmbeddings.pretrained(name = "bert_multi_cased", lang = "xx"). setInputCols("document", "token").setOutputCol("embeddings")

val nerTagger = new NerDLApproach().
  setInputCols("sentence", "token", "embeddings").
  setLabelColumn("label").
  setOutputCol("ner").
  setMaxEpochs(numEpochs).
  setLr(0.001f).
  setPo(0.005f).
  setBatchSize(10).
  setRandomSeed(0).
  setVerbose(1).
  setValidationSplit(0.2f).
  setEvaluationLogExtended(true).
  setEnableOutputLogs(true).
  setIncludeConfidence(true).
  setGraphFolder(graphFolder)

val ner_pipeline = new Pipeline().setStages(
  Array(
    glove_embeddings,
    nerTagger
  )
)

val ner_model = ner_pipeline.fit(training_data)`

Context

I am trying to train a model to be used in a NLP project but it is not possible to have such driver machine for this dataset.

Your Environment

Spark NLP version: 2.5.3
Java version (java -version): 1.8
Setup and installation (Pypi, Conda, Maven, etc.): Maven
Operating System and version: Ubuntu 18

maziyarpanahi commented 3 years ago

Hi,

Training always happens in 1 machine which in Apache Spark means the Driver. There cannot be distributed training due to the complexity of the algorithm and Spark limitation.
You are using a very large and complicated Word Embeddings bert_multi_cased. I suggest following the checkpoints strategy. First, transform the dataset and save it on disk. This happens in a distributed manner, then you can load the dataset back and go ahead with the training. There is an example here: https://github.com/JohnSnowLabs/spark-nlp-models/tree/master/training/ner_dl
I am interested to see by saving and loading the embeddings (like the example) how much will be the final memory usage. I trained all the multi-lingual NerDL models which come from WikiNER datasets that have between 80K to 90K training examples each. I never needed anything more than 128G memory. (my machine has only 190G anyway).
I used glove_840B_300 for multilingual, this embeddings seemed to be very accurate for the majority of the multilingual NER models. I recommend also trying this WordEmbeddings model instead of BertEmbeddings and see the difference.

lucianoalzugaraydc commented 3 years ago

Hi everyone, I have the same issue. I realized that this only happens when I use a custom tensorflow graph. I had two model trained in the same way with the same data, one have to found only two entities and the other have to found ~5000 entities. Running the first example, the execution with a memory of 16GB is success. For other hand, running the dataset with more than 5000 entities to recognize takes more than 198GB and still insufficient. @sdabbour-stratio what is the size of your tf graph?

maziyarpanahi commented 3 years ago

Hi, Please have a look at the release notes of 2.6.3, we did introduce a new param to NerDLApproach to fit each Epoch to the Driver memory with the need of increasing the memory if it's not possible. (Obviously the Epoch takes longer a bit if the param is enabled)

PS: the size of the unique chars, embeddings and total tags (in your case 5000) directly impact the memory as it has to learn 5000 unique classes instead of 10 or 30 which are normally the size for NER tags. (Totally doable though)

maziyarpanahi commented 3 years ago

We do have a new param .setEnableMemoryOptimizer(True) in NerDLApproach to optimize based on the available memory in the Driver.

pip install spark-nlp==2.7.0 --upgrade

JohnSnowLabs / spark-nlp