Open fcggamou opened 4 years ago
I noticed from the stack trace that error raises on evaluation over the dev set, so I reduced its size to half and now it works ok. It's not an ideal workaround though, any other suggestion is appreciated.
I'm a bit confused, when you originally trained the model, didn't you evaluate it on the dev set?
I did evaluate it on the dev set, and it worked, hence my confusion as well: why does it work when training from scratch but it fails when attempting to re-train?
And the pipeline was otherwise entirely the same? So there are no differences between the first training run and the resuming run (other than the "source" bit in the config, ofcourse)
Yes, exactly the same. Also the same train and dev data.
Looking at the output:
Aborting and saving the final best model. Encountered exception: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 15.75 GiB total capacity; 13.81 GiB already allocated; 78.88 MiB free; 14.34 GiB reserved in total by PyTorch)
It appears that your memory is already occupied somewhere else. I'm not sure how this fits together with the shell command (this should actually not apply here), but PyTorch can sometimes be a bit problematic when it comes to releasing GPU memory.
Edit: It could of course be that the allocation happens gradually and only the last part is shown. But it may be worth checking the memory allocation before resuming the training.
Either way, this part in language.evaluate() is probably the culprit:
if len(self.pipeline):
docs = list(docs)
This was added just for timing purposes. I think the code should simply still run without these two lines though - any chance you can check whether removing them improves things memory-wise? (your timing results will be temporarily wrong but let's worry about that later)
Oops, yeah that should be done differently. (But I don't understand why this ends up different in the second round than in the first?)
Great! Thanks a lot for the workaround, I will test this and post an update.
Just FYI the workaround did not work, I still get the same error in this line:
for i, (doc, eg) in enumerate(zip(docs, examples)):
I pulled your fix @adrianeboyd and still I get the same OOM exception at language.py line 1319:
# iterate over final generator
if docs is not None:
for doc in docs:
pass
Is this just for timing purposes? Can I safely remove those lines? Thanks!
It isn't just for timing purposes because you're not actually running the final component (which is the NER model you're trying to train) unless you iterate over that generator. (Earlier versions had the scorer iterate over this generator, and the overall goal here was to separate the pipeline timing from the scorer timing.) I think the previous version was still a bit clunky so I've reworked it a bit more. Can you try the updated version here? https://github.com/explosion/spaCy/pull/6386
Looking at this again, I think the problem might actually be that the default batch_size
(256) is too high for a GPU if you have some longer dev docs. We've trained a fair number of models internally, but we don't have many docs that are over a paragraph or so long. How many dev docs were you using? Were any particularly long? Using my updated PR, is it better if you manually lower the default batch_size
in the evaluate()
kwargs?
We're also running into some memory issues internally on CPU (for xx
models that we haven't published yet) either due to large training corpora or long dev docs, so I'll be looking into a few spots where we can improve the memory usage in the near future.
Since this is something that may need to be adjusted and have different defaults for CPU vs. GPU, I think we'll most likely need a way to specify the batch size for evaluate from the config, but I'm not sure exactly how yet. We may need to add a training
parameter like eval_batch_size
? We'll have to discuss what makes sense...
(And I still don't know what's going on with the differences between training from scratch and resuming.)
I'm using the nightly version, I have successfully trained a transformer based NER model and saved it; now I'm trying to resume training on it.
Firstly, I'm not sure if I have set up the config file correctly, the relevant part looks like this:
Now, after trying to train like this:
!python -m spacy train 'config.cfg' --output='model_t' --gpu-id=0 --paths.train train.spacy --paths.dev test.spacy
I'm getting this error message:
I understand the message is telling me I'm out of memory, but it seems weird that I'm able to train from scratch with no issues but getting this error when trying to resume training on the saved model. Any help is appreciated.
Your Environment