explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Spacy evaluate() Out Of GPU Memory on transformer models #9791

Closed mbrunecky closed 2 years ago

mbrunecky commented 2 years ago

Spacy language.evaluate() method is used both during Spacy training (to evaluate training progress against the 'dev' corpus), as well as separately, as a command-line operation (to assess trained model against some evaluation corpus).

When used with transformer models, evaluate often results in GPU Out Of Memory (OOM) error. The likelihood of OOM is directly related to the evaluated corpus size. In other words, the 'dev' (or eval) corpus size is limited by available GPU memory. In my case, 12GB RTX 3060 limits the 'dev' corpus size to below 500 * 900 word samples. But there is no simple formula to pre-compute this 'limit', and no parameters one could use to control the GPU memory usage. Determining the max 'dev' corpus size is a tedious trial-and-error approach.

One can use somewhat bigger corpus (~2 times) for the command-line evaluate(), because there is no GPU memory in use by the training.

The problem is caused by document 'trf_data' (tensors), attached to each 'predicted' document in the list of Example used for scoring. The trf_data (tensors) can not be released before the last pipeline component using it finished - usually the last pipeline component (there is more detail under discussion #9602 ).

The current language.evaluate() method processes the entire 'dev' (or eval) corpus as ONE batch, regardless of the batch_size setting (the batch_size is passed to individual pipeline components which may or may not honor it). Only after the entire 'dev' (or eval) set has been processed, code invokes the scorer, and only after the scoring it releases all the data.

This approach avoids any 'incremental' scoring. On the other side, keeping the entire 'dev' (or eval) corpus cached means a lot of cached data (each example carries both the 'dev' and 'predicted' document). When not using transformers, this data is kept in the CPU memory which may be considered 'cheap'. Just buy more. But in case of transformer models, the GPU memory is 'precious'.

The only 'workaround' is limiting (often severely) the 'dev' corpus. In my specific case, instead of recommended 10-20% of the 'train' corpus, I am limited to barely 5%. Depending on the dataset variability, limited 'dev' corpus may or may not be enough to represent the training accuracy.

Using Spacy 3.1, my 'solution' has been re-writing the language.evaluate() method to honor the batch_size, and performing incremental scoring for each batch_size document set. This required changing the scorer to support 'incremental' scoring - contrary to (now documented) scoring methods returns. Instead of returning a Dict of keyed float values (or a Dict of per-type scores), I needed something that can be aggregated - I used the PRFScore class. Added Aggregator class combines the incremental PRFScores and at the end produces the result identical to the original score() implementation: Dict[str, Union[float, Dict[str, float]]].

The PRFScore object provides the add() method, but some other currently reported float scores are lacking the 'scale' (acc=1.00 on WHAT?). Perhaps in some cases one can use the actual batch size as the score 'weight', but that may not be accurate.

Given that Spacy 3.2 officially documents the current, non-incremental scoring, I am not sure what 'other' solutions one can use. Perhaps supporting a 'custom' evaluate() method (along with custom scorers) would allow 'solutions' such as mine. In addition, one could add call to torch.cuda.empty_cache() at the start of evaluate() and report torch memory usage stats.

Your Environment

spaCy version 3.2.0 Location C:\Work\ML\Spacy3\lib\site-packages\spacy Platform Windows-10-10.0.19042-SP0 Python version 3.9.7 Pipelines en_core_web_lg (3.2.0), en_core_web_md (3.2.0), en_core_web_sm (3.2.0), en_core_web_trf (3.2.0) GPU NVIDIA GeForce RTX 3060 12GB CPU Intel Xeon E5-2867W 20 cores 64 GB

adrianeboyd commented 2 years ago

We did trade incremental scoring for more modular scoring in v3.

I'll take another look at nlp.evaluate to see if it's possible to make it more like nlp.pipe without losing any of the current configuration options. Actually, now that you can pass in Doc objects to nlp.pipe as of v3.2, I think we can just have it call nlp.pipe so the behavior is the same, and I think this should improve the memory usage if you have a custom component that deletes trf_data at the end of the pipeline.

In general, if you have memory usage issues during training related to the eval step, I think our best suggestion is to use a smaller, but still representative dev set. The dev set is primarily used for early stopping and to select model-best, so you could also use other parameters to determine how long the model should train such as max_steps or max_epochs and use model-last instead if you'd like to choose a different stopping point than the one picked using the dev set.

If you're calling nlp.evaluate or spacy evaluate directly, you do have similar constraints as during training, but if you want to score a larger dataset and the main issue is memory usage during processing, you can also create the examples separately, deleting trf_data or managing the docs however with custom components or post-processing, and then score using scorer.score. You wouldn't need to process all the docs at once as long as you can load the final annotation that you want to evaluate into the examples that you pass to the scorer.

mbrunecky commented 2 years ago

Great suggestion @adrianeboyd ! Instead of invoking pipeline for ALL Examples, invoking it in batch_size increments will allow adding a component releasing the trf_data at the end of each batch_size, instead of after the last example. It may still keep all Examples in memory, but releasing trf_data will reduce the GPU footprint to that of one (dev) batch_size. I believe the 'scorer' does not use the trf_data (it is not 'transformer' sensitive), all current scoring methods go by the Example.x,y docs.

On the smaller, but representative dev set: it depends. I have one dataset that is fairly consistent, and a small sample is fairly representative. I have another set, where it is not. In such cases, I use spacy command-line evaluate after-the-training, but (currently) it has the same limitation - uses the same code. Though it can use a little more GPU memory. Your fix (above) would be a great help there.

I also learned that computing scores every 200 iterations on a 10,000 document epoch just adds overhead. Perhaps it should be expressed as a fraction of 'epoch' (or documentation should recommend considering the 'epoch' size when setting it up).

On a minor side, the command-line evaluate() uses the Corpus constructor without passing in max_length or any other config parameters, making it 'different' than what is used in training. Specifically, using max_length=0 versus a non-zero size may lead to different results (I think I see the culprit in scorer code for NER).

adrianeboyd commented 2 years ago

O is a valid NER annotation so that shouldn't be the culprit. The sentences-as-docs introduced by max_length would make a difference in the predictions since the context isn't identical for the model when predicting. Otherwise the culprit might be sentence boundaries vs. no sentence boundaries, since the NER component won't predict entities across sentence boundaries.

mbrunecky commented 2 years ago

Thank you again, @adrianeboyd. I hope you can do something similar in the official code. It works like a champ... I can now run any evaluation corpus thru (just must be careful about the batch_size). In hindsight, I should have realized that releasing trf_data after each 'minibatch' (BUT keeping all examples) is enough to solve the GPU problem - without going to incremental scoring. Below is my code modification for language.py evaluate() between the start/stop times. The code I added to release the trf_data (if present) can be probably conditionalized even more, if one can detect the transformer presence in pipeline in some simple way. I did not try to deal with the Corpus (not using model configuration), because I do not use any.

CODE (note there are only couple changed lines):

       # reset annotation in predicted docs and time tokenization
        start_time = timer()
        # this is purely for timing
        for eg in examples:
            self.make_doc(eg.reference.text)
        # split all examples into batches
        for example_batch in util.minibatch(examples, batch_size):     
            # apply all pipeline components
            for name, pipe in self.pipeline:
                kwargs = component_cfg.get(name, {})
                kwargs.setdefault("batch_size", batch_size)
                for doc, eg in zip(
                    _pipe(
                        (eg.predicted for eg in example_batch),
                        proc=pipe,
                        name=name,
                        default_error_handler=self.default_error_handler,
                        kwargs=kwargs,
                    ),
                    example_batch,
                ):
                    eg.predicted = doc
            # release any (no more needed) transformer data to free GPU memory
            for eg in example_batch:
                if eg.predicted.has_extension("trf_data"):
                   eg.predicted._.trf_data = None

        end_time = timer()
        results = scorer.score(examples)
mbrunecky commented 2 years ago

I am glad you pointed out "the culprit might be sentence boundaries vs. no sentence boundaries, since the NER component won't predict entities across sentence boundaries". I am not sure if you mean "NER entities may not cross sentence boundaries", or if you mean "NER entity scope is limited to a sentence".

Is there any documentation describing the 'assumptions and tradeoffs' that the NER component is making?

Some of them may explain why I am still unable to achieve Spacy 2 accuracy with Spacy 3 - even on identical data sets.

mbrunecky commented 2 years ago

Well, if I read the enhancement code in #9800 correctly, it restores batching, but does not automatically release the trf_data. That means users will still suffer GPU Out Of Memory errors unless they add some "free_trf_data" component to their pipeline. Since the evaluate() code is completely self-contained, it could automatically release trf_data without requiring the user suffering thru the OOM analysis and search for solution. It is hard enough to tune 'training' portion to fit into GPU, the OOM during evaluate() just adds a misleading distraction.

adrianeboyd commented 2 years ago

v3.2.1 adds the built-in doc_cleaner component and we will consider whether we want to add it to the provided trf pipelines by default. Many users do want access to trf_data in their docs at the end of the pipeline, so it's hard to know what the best default is.

If you're training your own model, it'll be up to you to decide how to handle trf_data in your pipeline. The core spacy library never touches any custom extensions because you can really and truly never know what users might want to do with their extensions, so we don't want to try to have Language.evaluate do anything with trf_data by default.

mbrunecky commented 2 years ago

My point is that within the scope of evaluate(), the user can NOT do anything with the document – it is discarded by the evaluate() code. Most users using Spacy probably do not start with custom pipeline components. And forcing an end user to learn how to add custom components only to avoid running out of GPU memory using ‘standard training’ does not feel very user friendly.

Even worse, IF I design my pipeline to keep trf_data (or any other data) attached to document after the pipeline completed (for whatever need), for the training evaluate() step sake I would have to use a different pipeline. And changing config.cfg inside the model after a completed training does not feel like a ‘right’ approach.

adrianeboyd commented 2 years ago

I understand that this is frustrating, but there's no good general-purpose way for spacy to handle this for data saved in custom extensions in Language.evaluate. (It would be weird, but you could imagine a scoring method that referred to trf_data for some reason.) My impression is that users more typically run out of memory in the training step than in the eval step, and they often turn to other options to manage memory usage: lowering batch sizes, using a smaller transformer model, using a CNN model instead, etc., so this will mainly add another option to this list if it turns out that the eval step is the issue.

It's not intended to be difficult to remove or disable a component before packaging a pipeline for distribution. You can use spacy assemble with a new config to define the pipeline and source the components you want and also disable the ones that you don't want to be run by default, but want to be available for particular purposes, like we do for senter in the trained pipelines.

mbrunecky commented 2 years ago

Thank you, Adriane, for your explanation.

I understand the difficulty of the task. You are more aware of all the possible use cases. I guess all you can do is document the possible GPU OOM scenarios. My impression from Spacy discussions is that most people running into GPU OOM have no idea what to look for (I was definitely one of those) and end up beating the wrong horse – such as the nlp batch size (which in the trf_data case did not help).

It seems that majority of Spacy users use human marked up training data, which limits the sample sizes, and hence GPU memory is not an issue. People like me using machine-generated data markup seem a minority. But we tend to use larger training samples (“more is (usually) better”).

Anyway, I appreciate your work on this issue (and Spacy overall). Despite harping on it, I like Spacy 3 features – it was a monumental step-up. Of course I focus on the features I need to build a cloud production system (submit - train – deploy – utilize).

Thanks again and keep up the good work!

From: Adriane Boyd @. Sent: Monday, December 13, 2021 9:36 AM To: explosion/spaCy @.> Cc: Martin Brunecky @.>; Author @.> Subject: [EXT] - Re: [explosion/spaCy] Spacy evaluate() Out Of GPU Memory on transformer models (Issue #9791)

I understand that this is frustrating, but there's no good general-purpose way for spacy to handle this for data saved in custom extensions in Language.evaluate. (It would be weird, but you could imagine a scoring method that referred to trf_data for some reason.) My impression is that users more typically run out of memory in the training step than in the eval step, and they often turn to other options to manage memory usage: lowering batch sizes, using a smaller transformer model, using a CNN model instead, etc., so this will mainly add another option to this list if it turns out that the eval step is the issue.

It's not intended to be difficult to remove or disable a component before packaging a pipeline for distribution. You can use spacy assemble with a new config to define the pipeline and source the components you want and also disable the ones that you don't want to be run by default, but want to be available for particular purposes, like we do for senter in the trained pipelines.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/9791#issuecomment-992658518, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOGQRXIT6W6G4NMSTBEHGDDUQYOGPANCNFSM5JF2NR4A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.