Open Julia-Penfield opened 3 years ago
The basic spacy train
training loop only supports one GPU.
I think in theory you would want to configure ray
workers so that each was associated with one particular GPU, but I haven't tried this in practice and I'm not sure how difficult it would be.
Taking a quick look at spacy-ray
, it looks like there might be a bug in how it sets the GPU ID from CUDA_VISIBLE_DEVICES
, but it looks like most of the setup is there for this to work if the workers are configured correctly. But maybe I'm completely misunderstanding how ray
manages this in the first place. See this comment:
https://github.com/explosion/spacy-ray/blob/master/spacy_ray/worker.py#L300-L308
The main thing I don't understand is why this bit calls require_gpu(0)
rather than require_gpu(gpu_id)
after referencing CUDA_VISIBLE_DEVICES
:
I think it would be difficult to use it with a script based around nlp.update
vs. using spacy ray train
. spacy-ray
is still under development and has mostly been tested on CPU, so it's possible it would just require a few small patches/PRs for the GPU support to be improved enough to work in this scenario.
Let us know how it works for you if you try it out!
Thank you for your reply. I think it is time for me to move on to spaCy v3, so I converted my code and data to use spaCy ray train. I am providing some details below before showing you the error I get.
nlp = spacy.blank('en')
def make_v3_training_data(data):
failed_record = []
db = DocBin()
for text, annot in tqdm(data):
doc = nlp.make_doc(text)
ents = []
for start, end, label in annot['entities']:
span = doc.char_span(start, end, label = label, alignment_mode = 'contract')
if span is None:
print('empty entity') #I expect this to never happen
else:
ents.append(span)
try:
doc.ents = ents
except:
failed_record.append((text, annot))
db.add(doc)
return db, failed_record
I downloaded the base config file from spaCy guidelines and edited the first 2 lines (train address and dev address):
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "train.spacy"
dev = "val.spacy"
[system]
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
@optimizers = "Adam.v1"
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[initialize]
vectors = null
Then I used: !python -m spacy init fill-config base_config.cfg config.cfg It returned: ✔ Auto-filled config with all values ✔ Saved config config.cfg
To use ray, I executed the following line because I thought I'd try it without GPU first: !python -m spacy train config.cfg --output ./output
This worked fine and training started. Afterwards, to push the envelope a little, I tried: !python -m spacy train config.cfg --output ./output --gpu-id 0
This worked fine too using GPU - one core I presume. Eventually it failed due to memory: cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 2,466,849,280 bytes (allocated so far: 10,176,587,264 bytes).
Then I tried the follow to use ray: !python -m spacy ray train config.cfg --n-workers 2 --output ./output
However, this time I got: TypeError: create_train_batches() missing 1 required positional argument: 'max_epochs'
At this point, I have a few questions that I am researching to find answers for.
1- Does max_epochs belong to the config file?
2- I share the confusion with you about the spacy ray code as no matter what the gpu ID is, it uses 0. Along those lines of confusion, I wonder that once the max_epochs issue is resolved, should I use the command below for GPU: !python -m spacy ray train config.cfg --n-workers 2 --output ./output #--gpu-id 0
In the command above, the issue is that I am sure how to tell ray which GPU IDs to use. Does it take a list of IDs? Any ideas?
3- As an irrelevant note to GPUs, I already have a trained word2vec model. In v2, I used the following to load it:
def load_word_vectors(model_name, word_vectors):
subprocess.run([sys.executable,
"-m",
"spacy",
"init-model",
"en",
model_name,
"--vectors-loc",
word_vectors
])
Do you know how to include it in the config file in v3 so that the training does not spend time on training token2vector embedding matrix?
If you have created a model with vectors using spacy init vectors
(the v3 CLI command for this), you then specify it under [initialize.vectors]
and set include_static_vectors = true
for the relevant components.
I created my word2vec model using gensim.
!pip install --upgrade gensim
from gensim.models.phrases import Phrases, Phraser
#Phrases() takes a list of list of words as input. "txt" is my corpus of text.
sent = [text.split() for text in txt]
#Creates the relevant phrases from the list of sentences:
phrases = Phrases(sent, min_count=30, progress_per=10000)
#The goal of Phraser() is to cut down memory consumption of Phrases(), by discarding model state not strictly needed for the bigram detection task:
bigram = Phraser(phrases)
#Transform the corpus based on the bigrams detected:
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(min_count=20,
window=2,
vector_size=300,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
w2v_model.save('w2v_model.model')
w2v_model.wv.save_word2vec_format('w2v_model.txt')
In v2, I used the code at the end of my last post to load the word2vec model. I am playing to see if I could do something similar in v3.
Yes, just run spacy init vectors
. The options are in a slightly different format than v2 spacy init-model
, but it's very similar.
Since spacy actually doesn't include any code for training static word vectors, the only way to include them is to use the initialize.vectors
setting. But to have them used as features in the pipeline components, you also need to set include_static_vectors = true
in the relevant places related to the HashEmbed
config sections.
Thanks for that! I really appreciate your support.
I read the guideline and successfully converted the word2vec txt model to v3 format which is saved in a folder called "vocab" using:
!python -m spacy init vectors en w2v_model.txt ./
I also found the include_static_vectors field in the base_config.cfg file and changed it to True:
[components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v2" width = ${components.tok2vec.model.encode.width} attrs = ["ORTH", "SHAPE"] rows = [5000, 2500] include_static_vectors = true
Interestingly, in the last line, "True" returned an error and "true" worked.
How could I use initialize.vectors? Should I change the base_config.cfg file as follows?
[initialize] vectors = "./vocab"
It seems working and the NER loss is lower than it used to be. Is there any way/test to ensure that my word vectors are being used as initial value?
Going forward, I shift my focus to ray. In the first phase, I will try multi CPU using the following command. Once I get this work, I will try multi GPU:
!python -m spacy ray train config.cfg --n-workers 2 --output ./output
At this time, the problem is getting the error:
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-18 18:29:29,031 INFO resource_spec.py:231 -- Starting Ray with 35.16 GiB memory available for workers and up to 17.6 GiB for objects. You can adjust these settings with ray.init(memory=
I posted it as a separate issue. Once I get it resolved, I will proceed with a test for multiple GPUs.
I'm having the same issue regarding the error
TypeError: create_train_batches() missing 1 required positional argument: 'max_epochs'
I've looked at nvidia-smi
after using spacy-ray
with --n-workers 4
, and I see all my GPUs memory usage spiking, so it seems like something is happening with all of them, however after a while the training fails with that error message.
@thejamesmarq I am happy that it was not just me. Also, glad that someone else is working towards multiple GPU training.
@adrianeboyd fixed the "max_epoch" issue about 10 hours ago (see https://github.com/explosion/spaCy/issues/8137) and released a new version of spaCy ray. I reinstalled and the max_epoch problem is gone. If they had patreon, I would not hesitate contributing to it for a second!
I have not tried the multiple GPU solution yet, but the multiple solution using ray seem to be working! I cannot say if it increased the training speed though - need to dig in further. @thejamesmarq Could I ask you to give it a shot and let me know if spacy ray v0.1.2 works for you - with and without using GPUs? I am a little surprised that all your 4 GPU cores were working as the workers are for CPU as far as I understand, not GPU. Did you use the following command?
!python -m spacy ray train config.cfg --n-workers 2 --output ./output --gpu-id 0
Tried this out on a machine with 4 GPUs, using --n-workers 4
and while the job does run now, it looks like only one GPU is being used, although four processes are being created. Memory on each GPU is occupied, but utilization was at 0% for all except one GPU.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 56W / 300W | 1757MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 45C P0 56W / 300W | 1774MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 44C P0 61W / 300W | 1782MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 48C P0 59W / 300W | 1818MiB / 16160MiB | 10% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 82467 C ray::Worker 1755MiB |
| 1 N/A N/A 82460 C ray::Worker 1771MiB |
| 2 N/A N/A 82464 C ray::Worker 1779MiB |
| 3 N/A N/A 82469 C ray::Worker 1815MiB |
+-----------------------------------------------------------------------------+
I'm wondering if that is at all related how ray_train
in https://github.com/explosion/spacy-ray/blob/master/spacy_ray/train_cli.py uses ray.remote
:
RemoteWorker = ray.remote(Worker).options(num_gpus=int(use_gpu >= 0), num_cpus=2)
I believe this will use at most 1 GPU, although I could be wrong.
Also, noticing the Worker
in https://github.com/explosion/spacy-ray/blob/master/spacy_ray/worker.py uses gpu_id = int(os.environ.get("CUDA_VISIBLE_DEVICES", -1))
. Does this mean if we don't have that variable set, it will default to using CPU (with default set at -1). I'm wondering if we might instead specify the GPU id directly, like as a CLI option.
Gotch ya!
I think we could determine the gpu ID using the CLI option like you mentioned:
!python -m spacy ray train config.cfg --n-workers 4 --output ./output --gpu-id 0
However, there are two issues:
1) Firstly, it seems that only one gpu ID can be provided, 2) As @adrianeboyd mentioned before, regardless of which gpu ID is used in the CLI command, only ID=0 is used as per: a) https://github.com/explosion/spacy-ray/blob/master/spacy_ray/worker.py#L300-L308 b) https://github.com/explosion/spacy-ray/blob/75cdb637411529f7f0a41c723a8ab71cbae9cc79/spacy_ray/worker.py#L251-L259
Could you try one more time using the CLI command and see if changing the gpu-id makes a difference in terms of the gpu core used? I will do the same on my end.
(No need for patreon: this is my job!)
For local testing I only have one GPU, so I may not be much immediate help. The spacy train
CLI doesn't have a way to specify multiple GPUs, so if this is going to work, I think it's most likely that you'd use -g 0
to enable GPU in the CLI in general and then check how ray sets CUDA_VISIBLE_DEVICES
for the workers. If ray manages the GPU IDs that way, then you'd just need to update the function that sets the GPU ID with spacy.require_gpu
to use the one provided by ray.
Thanks @adrianeboyd . I actually did not know it's your job. You're great at it!! :)
I have two questions and 4 reports for you. I really hope those reports are helpful for further development of spaCy v3 given how amazing it is!
Q1) What am I doing wrong in using -g 0. Below is my CLI: !python -m spacy ray train -g 0 config.cfg --n-workers 8 --output ./output
The error is: ✘ Invalid config override '0': name should start with --
Also, Isn't -g 0 a spaCy v2 notion? The guidelines of spaCy 3 says to use --gpu-id.
Q2) Isn't 0 in "-g 0" referring to the GPU ID? How is that different from using gpu-id as in: !python -m spacy ray train config.cfg --n-workers 8 --output ./output --gpu-id 0
I tried 4 main experiments. All experiments are executed on an ml.p3.8xlarge AWS EC2 instance with: 4 GPUs, 32 vCPUs, 244G memory, 64G CPU memory.
!python -m spacy train config.cfg --output ./output
Here is the output:
Starting time: 18:41:46 ℹ Using CPU ℹ To switch to GPU 0, use the option: --gpu-id 0
=========================== Initializing pipeline =========================== [2021-05-20 18:41:47,737] [INFO] Set up nlp object from config [2021-05-20 18:41:47,751] [INFO] Pipeline: ['tok2vec', 'ner'] [2021-05-20 18:41:47,755] [INFO] Created vocabulary [2021-05-20 18:41:48,137] [INFO] Added vectors: ./vocab [2021-05-20 18:41:48,137] [INFO] Finished initializing nlp object [2021-05-20 18:43:51,149] [INFO] Initialized pipeline components: ['tok2vec', 'ner'] ✔ Initialized pipeline
============================= Training pipeline ============================= ℹ Pipeline: ['tok2vec', 'ner'] ℹ Initial learn rate: 0.001 E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
0 0 0.00 848.15 0.01 0.01 0.01 0.00 0 200 3862.07 15785.41 18.75 34.10 12.93 0.19 0 400 517.14 1764.46 36.33 45.66 30.17 0.36 0 600 4547.08 1981.70 36.27 49.80 28.53 0.36 0 800 1288.11 1814.77 38.01 52.13 29.91 0.38 0 1000 479.66 1090.03 45.89 54.09 39.85 0.46 0 1200 645.63 870.98 46.93 55.09 40.87 0.47 0 1400 996.74 841.99 46.35 56.44 39.32 0.46 0 1600 6994.02 976.14 39.57 62.33 28.99 0.40 0 1800 2305.24 1002.07 46.91 55.71 40.51 0.47 0 2000 658.52 625.77 48.02 55.59 42.26 0.48 0 2200 1872.88 782.63 50.34 55.98 45.73 0.50 0 2400 1100.66 634.23 48.51 65.04 38.68 0.49 0 2600 3185.53 855.52 49.80 64.04 40.74 0.50 0 2800 24985.04 943.65 44.04 54.52 36.95 0.44 0 3000 7799.75 996.41 48.51 64.74 38.78 0.49 0 3200 3739.99 662.83 52.15 66.79 42.77 0.52 0 3400 1390.63 598.66 45.86 52.33 40.82 0.46 1 3600 2314.42 533.74 49.05 69.72 37.83 0.49 1 3800 1527.68 454.00 46.90 53.44 41.79 0.47 1 4000 4403.58 561.79 51.23 71.50 39.92 0.51 1 4200 2051.66 559.38 44.76 45.04 44.49 0.45 1 4400 3131.92 651.52 49.34 70.12 38.06 0.49 1 4600 3381.79 602.43 39.74 33.66 48.50 0.40 1 4800 1502.89 601.55 49.72 58.03 43.49 0.50 ✔ Saved pipeline to output directory output/model-last Ending time: 20:32:09 Total elapsed time: 1.84 hours
CPU:
top - 18:47:27 up 31 min, 0 users, load average: 1.00, 0.94, 0.90 Tasks: 359 total, 2 running, 223 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.4%sy, 0.0%ni, 95.6%id, 0.5%wa, 0.0%hi, 0.0%si, 1.4%st Mem: 251745828k total, 10681440k used, 241064388k free, 209764k buffers Swap: 0k total, 0k used, 0k free, 3111976k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28251 ec2-user 20 0 27.4g 5.6g 198m R 100.5 2.3 5:44.80 python
25540 ec2-user 20 0 616m 50m 13m S 2.0 0.0 0:01.72 python
1 root 20 0 19780 2640 2200 S 0.0 0.0 0:10.63 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
5 root 20 0 0 0 0 I 0.0 0.0 0:00.64 kworker/u256:0
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
8 root 20 0 0 0 0 I 0.0 0.0 0:00.61 rcu_sched
9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
10 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/0
11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
15 root RT 0 0 0 0 S 0.0 0.0 0:00.36 migration/1 top - 18:47:30 up 31 min, 0 users, load average: 1.00, 0.94, 0.90
Tasks: 359 total, 2 running, 223 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.2%us, 0.0%sy, 0.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 251745828k total, 10681596k used, 241064232k free, 209764k buffers
GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 35C P0 52W / 300W | 312MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 33C P0 35W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 32C P0 39W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P0 38W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 28251 C python 309MiB |
+-----------------------------------------------------------------------------+
Sorry for the table format. I do not know how to neatly paste it like @thejamesmarq did.
Execusion-wise, so far so good.
!python -m spacy train config.cfg --output ./output --gpu-id
Here is the output:
Starting time: 20:18:14 ℹ Using GPU: 0
=========================== Initializing pipeline =========================== [2021-05-20 20:18:16,231] [INFO] Set up nlp object from config [2021-05-20 20:18:16,245] [INFO] Pipeline: ['tok2vec', 'ner'] [2021-05-20 20:18:16,249] [INFO] Created vocabulary [2021-05-20 20:18:16,719] [INFO] Added vectors: ./vocab [2021-05-20 20:18:16,720] [INFO] Finished initializing nlp object [2021-05-20 20:20:40,810] [INFO] Initialized pipeline components: ['tok2vec', 'ner'] ✔ Initialized pipeline
============================= Training pipeline ============================= ℹ Pipeline: ['tok2vec', 'ner'] ℹ Initial learn rate: 0.001 E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
âš Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('Out of memory allocating 3,327,510,528 bytes (allocated so
far: 13,736,083,968 bytes).',)
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/main.py", line 4, in
!python -m spacy ray train config.cfg --n-workers 8 --output ./output
Here is the output:
Starting time: 20:56:09
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 20:56:10,996 INFO resource_spec.py:231 -- Starting Ray with 157.81 GiB memory available for workers and up to 71.63 GiB for objects. You can adjust these settings with ray.init(memory=
PID MEM COMMAND 50810 31.99GiB ray::Worker 50818 29.35GiB ray::Worker 50800 29.24GiB ray::Worker 50788 29.23GiB ray::Worker 50803 28.99GiB ray::Worker 50791 28.3GiB ray::Worker 50783 26.65GiB ray::Worker 50802 6.46GiB ray::Worker 50718 1.58GiB /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/core/src/ray/thirdparty/redis/ 50738 0.43GiB /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/core/src/ray/raylet/raylet r
In addition, up to 11.08 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory
parameter when starting Ray.
Tip: Use the ray memory
command to list active objects in the cluster.
2021-05-20 21:30:22,277 ERROR worker.py:1074 -- Possible unhandled error from worker: ray::Worker.set_param() (pid=50800, ip=172.16.41.157) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 449, in wrapper ret = self._cache[fun] AttributeError: _cache
During handling of the above exception, another exception occurred:
ray::Worker.set_param() (pid=50800, ip=172.16.41.157) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1515, in wrapper return fun(self, *args, kwargs) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 452, in wrapper return fun(self) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1557, in _parse_stat_file with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f: File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 713, in open_binary return open(fname, "rb", kwargs) FileNotFoundError: [Errno 2] No such file or directory: '/proc/74685/stat'
During handling of the above exception, another exception occurred:
ray::Worker.set_param() (pid=50800, ip=172.16.41.157) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/init.py", line 371, in _init self.create_time() File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/init.py", line 727, in create_time self._create_time = self._proc.create_time() File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1515, in wrapper return fun(self, *args, **kwargs) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1727, in create_time ctime = float(self._parse_stat_file()['create_time']) File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1522, in wrapper raise NoSuchProcess(self.pid, self._name) psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=74685)
During handling of the above exception, another exception occurred:
....
....
....
Julia: This error was too long to copy entirely
....
....
....
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/main.py", line 4, in
CPU:
top - 20:26:13 up 24 min, 0 users, load average: 9.45, 5.35, 2.89 Tasks: 398 total, 2 running, 261 sleeping, 0 stopped, 0 zombie Cpu(s): 6.0%us, 0.8%sy, 0.0%ni, 90.8%id, 0.5%wa, 0.0%hi, 0.0%si, 1.8%st Mem: 251745828k total, 43391664k used, 208354164k free, 212780k buffers Swap: 0k total, 0k used, 0k free, 4250572k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24491 ec2-user 20 0 78.3g 4.9g 565m S 202.5 2.1 10:14.52 ray::Worker
24476 ec2-user 20 0 77.5g 4.0g 492m R 169.1 1.7 6:52.37 ray::Worker
24488 ec2-user 20 0 78.6g 5.1g 731m S 147.5 2.1 5:21.48 ray::Worker
24465 ec2-user 20 0 78.6g 5.2g 797m S 137.6 2.2 5:20.63 ray::Worker
24485 ec2-user 20 0 78.6g 5.4g 985m S 135.7 2.2 5:22.22 ray::Worker
24458 ec2-user 20 0 78.4g 5.0g 798m S 121.9 2.1 5:30.25 ray::Worker
24460 ec2-user 20 0 78.6g 5.2g 730m S 119.9 2.2 5:20.65 ray::Worker
24457 ec2-user 20 0 78.6g 5.1g 752m S 112.1 2.1 5:21.96 ray::Worker
24417 ec2-user 20 0 145g 113m 32m S 25.6 0.0 0:32.22 raylet
24390 ec2-user 20 0 838m 42m 9960 S 11.8 0.0 0:28.25 gcs_server
24386 ec2-user 20 0 354m 132m 7652 S 3.9 0.1 0:03.09 redis-server
24477 ec2-user 20 0 72.9g 61m 28m S 3.9 0.0 0:05.03 ray::IDLE
8958 root 20 0 6524 96 0 S 2.0 0.0 0:03.62 rngd
24243 ec2-user 20 0 5891m 172m 86m S 2.0 0.1 0:15.80 python
24330 ec2-user 20 0 94.6g 183m 100m S 2.0 0.1 0:18.47 python -m spacy
24419 ec2-user 20 0 224m 54m 24m S 2.0 0.0 0:14.73 /home/ec2-user/
24462 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:05.00 ray::IDLE top - 20:26:16 up 24 min, 0 users, load average: 9.45, 5.35, 2.89
Tasks: 398 total, 4 running, 259 sleeping, 0 stopped, 0 zombie
Cpu(s): 31.2%us, 2.6%sy, 0.0%ni, 65.4%id, 0.0%wa, 0.0%hi, 0.4%si, 0.5%st
Mem: 251745828k total, 43732944k used, 208012884k free, 212780k buffers
Swap: 0k total, 0k used, 0k free, 4266300k cached
GPU:
Thu May 20 20:27:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC. |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 31C P0 37W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 31C P0 35W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 32C P0 40W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P0 39W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
!python -m spacy ray train config.cfg --n-workers 8 --output ./output --gpu-id 0
Here is the output:
Starting time: 20:20:36
ℹ Using GPU: 0
2021-05-20 20:20:38,393 INFO resource_spec.py:231 -- Starting Ray with 157.86 GiB memory available for workers and up to 71.67 GiB for objects. You can adjust these settings with ray.init(memory=
1- ray does not seem to be working for me. It started and successfully completed a few iterations as shown in experiment 3, but then it fails. When I use a high number of vCPUs with ray, similar to experiment 3, for example with 20 cores, it crashed right from the start with the following error.
Starting time: 22:29:04
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 22:29:06,082 INFO resource_spec.py:231 -- Starting Ray with 157.76 GiB memory available for workers and up to 71.62 GiB for objects. You can adjust these settings with ray.init(memory=
Output of re-execution of experiment 3 (ray without GPU) with only 4 cores:
Starting time: 21:01:57
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 21:01:59,277 INFO resource_spec.py:231 -- Starting Ray with 157.86 GiB memory available for workers and up to 71.65 GiB for objects. You can adjust these settings with ray.init(memory=
CPU:
top - 22:30:43 up 2:19, 0 users, load average: 2.62, 2.58, 2.74 Tasks: 404 total, 1 running, 262 sleeping, 0 stopped, 0 zombie Cpu(s): 7.6%us, 1.0%sy, 0.0%ni, 90.7%id, 0.1%wa, 0.0%hi, 0.1%si, 0.4%st Mem: 251745828k total, 68348844k used, 183396984k free, 1134280k buffers Swap: 0k total, 0k used, 0k free, 4776260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37307 ec2-user 20 0 99.5g 24g 1.3g S 167.2 10.4 149:22.53 ray::Worker
37306 ec2-user 20 0 81.0g 8.0g 1.3g S 133.8 3.3 123:08.39 ray::Worker
37298 ec2-user 20 0 85.7g 12g 1.3g S 17.7 5.4 51:21.52 ray::Worker
37328 ec2-user 20 0 86.2g 13g 1.3g S 13.8 5.5 51:03.83 ray::Worker
37217 ec2-user 20 0 838m 40m 9728 S 9.8 0.0 5:21.60 gcs_server
37233 ec2-user 20 0 145g 222m 117m S 3.9 0.1 5:00.49 raylet
37235 ec2-user 20 0 224m 54m 24m S 3.9 0.0 2:25.46 /home/ec2-user/
9229 ec2-user 20 0 616m 50m 13m S 2.0 0.0 0:00.76 python
37053 ec2-user 20 0 5891m 172m 86m S 2.0 0.1 1:56.94 python
37148 ec2-user 20 0 94.6g 183m 101m S 2.0 0.1 2:19.44 python -m spacy
37205 ec2-user 20 0 177m 11m 7836 S 2.0 0.0 0:47.88 redis-server
37276 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:45.10 ray::IDLE
37277 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.12 ray::IDLE
37278 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.32 ray::IDLE
37281 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.32 ray::IDLE
37285 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.29 ray::IDLE
37287 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:45.49 ray::IDLE top - 22:30:46 up 2:19, 0 users, load average: 2.62, 2.58, 2.74
Tasks: 404 total, 1 running, 262 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.9%us, 1.6%sy, 0.0%ni, 90.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.1%st
Mem: 251745828k total, 69199252k used, 182546576k free, 1134280k buffers
GPU:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 30C P0 35W / 300W | 3MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 31C P0 35W / 300W | 3MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 31C P0 38W / 300W | 3MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 31C P0 39W / 300W | 3MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
2- It looks like there are issues in spaCy v3 regarding using gpu whether or not ray is used. I got 2 different kinds of error, one with ray, and one without ray, as described in experiments #3 and #4. I understand that the issue may be on my end, not spaCy. But I successfully ran the training on gpu using spaCy v2 as described at the beginning of this thread. Any ideas? At this point, I cannot even get one GPU working :)
3- I re-executed experiment 1 to see if the results are replicate-able. I am happy to say that I got the exact same result with a slight different execution time which is understandable.
4- When I use ray, it "appears" that the word2vec initiation is ignored. The reason I think that is by looking at the loss tok2vec at experiment 1 and experiment 3. It seems that when I use ray in experiment 3, the loss starts at a significantly higher value. Does ray have a different initiation mechanism?
5- When I used 4 cores in re-execusion of experiment 3, the model took alot longer to train compared to experiment one using one core - both experiments did not use GPU as I have not succeeded in using GPU in spaCy C3 yet. As shown in the results of experiment 1, with one core, the training took 1.84 hours. However, with 4 cores, it took 3.21 hours. This is a counter intuitive outcome, even if we only consider the same number of iterations as single core execution in experiment 1 - I checked on my end and it took twice as long for the same number of iterations! This gave me the idea that maybe the issue is with ray. So I re-executed experiment 3 (ray without GPU) with one core only, and it took similar time (under 2 hours) compared to experiment one which does not use ray - the accuracies were also different but I get that part.
In conclusion, something is fishy when using ray as the more cores I use the slower the execution gets!! I checked ray using 2 cores and it took just over 2 hours! :) Are there any known issues with ray that is being worked on for v3.1 release?
6- If you teach me how to use -g 0, I am happy to re-run the 2 experiments with GPU and see if multiple cores will be used. Though at this point, even one GPU destroys the memory!
Please let know if there are any experiments that you're interested in. I am happy to assist you.
Thanks! -Jules
Any updates on this?
Hi @delucca, not yet, unfortunately.
Hello, I am training my NER model using the following code:
Start of Code
End of Code
The problem:
The issue is that each iteration take about 30 minutes - I have 8000 training records which include very long texts and also 6 labels.
So I was hoping to reduce it using more GPU cores, but it seems that only one core is being used - when I execute print(util.gpu) in the code above, only the first core returns a non zero value .
Question 1: Is there any way I could use more GPU cores in the training process to make it faster? I would appreciate any leads.
Edit: After some more research, it seems that spacy-ray is intended to enable parallel training. But I cannot find the documentation on using Ray in the nlp.update as all I find is about using "python -m spacy ray train config.cfg --n-workers 2." Question 2: Does Ray enable parallel processing using GPUs, is it only for CPU cores? Question 3: How could I integrate Ray in the python code I have using nlp.update as opposed to using "python -m spacy ray train config.cfg --n-workers 2." ?
Thank you!
Environment
All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p3.2xlarge EC2 instance. Python Version Used: 3 spaCy Version Used: 3.0.6