Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Other
1.85k stars 406 forks source link

将配置文件中device改成cpu后仍然报错"cuda runtime error (2) : out of memory" #62

Closed cheniison closed 4 years ago

cheniison commented 4 years ago

将配置文件中device属性改成了"cpu"后,运行命令: python train.py conf/train.json

仍然报错:

Use dataset to generate dict. Size of doc_label dict is 3 Size of doc_token dict is 2629 Size of doc_char dict is 2629 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 Shrink dict over. Size of doc_label dict is 3 Size of doc_token dict is 2396 Size of doc_char dict is 2396 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=2 : out of memory Traceback (most recent call last): File "train.py", line 245, in train(config) File "train.py", line 212, in train trainer.train(train_data_loader, model, optimizer, "Train", epoch) File "train.py", line 101, in train ModeType.TRAIN) File "train.py", line 117, in run for batch in data_loader: File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 178, in _pin_memory_loop batch = pin_memory_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in pin_memory_batch return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 239, in pin_memory_batch return batch.pin_memory() RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp:265

使用的是自己生成的中文多标签数据集,其中每个字是一个token,其他配置没有改变,其中一些配置如下

{ "task_info":{ "label_type": "multi_label", "hierarchical": false, "hierar_taxonomy": "data/fdqb.taxonomy", "hierar_penalty": 0.000001 }, "device": "cpu", "model_name": "TextCNN", "checkpoint_dir": "checkpoint_dir_rcv1", "model_dir": "trained_model_rcv1", "data": { "train_json_files": [ "data/fdqb_train.json" ], "validate_json_files": [ "data/fdqb_dev.json" ], "test_json_files": [ "data/fdqb_test.json" ] ...

coderbyr commented 4 years ago

将配置文件中device属性改成了"cpu"后,运行命令: python train.py conf/train.json

仍然报错:

Use dataset to generate dict. Size of doc_label dict is 3 Size of doc_token dict is 2629 Size of doc_char dict is 2629 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 Shrink dict over. Size of doc_label dict is 3 Size of doc_token dict is 2396 Size of doc_char dict is 2396 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=2 : out of memory Traceback (most recent call last): File "train.py", line 245, in train(config) File "train.py", line 212, in train trainer.train(train_data_loader, model, optimizer, "Train", epoch) File "train.py", line 101, in train ModeType.TRAIN) File "train.py", line 117, in run for batch in data_loader: File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 178, in _pin_memory_loop batch = pin_memory_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in pin_memory_batch return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 239, in pin_memory_batch return batch.pin_memory() RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp:265

使用的是自己生成的中文多标签数据集,其中每个字是一个token,其他配置没有改变,其中一些配置如下

{ "task_info":{ "label_type": "multi_label", "hierarchical": false, "hierar_taxonomy": "data/fdqb.taxonomy", "hierar_penalty": 0.000001 }, "device": "cpu", "model_name": "TextCNN", "checkpoint_dir": "checkpoint_dir_rcv1", "model_dir": "trained_model_rcv1", "data": { "train_json_files": [ "data/fdqb_train.json" ], "validate_json_files": [ "data/fdqb_dev.json" ], "test_json_files": [ "data/fdqb_test.json" ] ...

将配置文件中device属性改成了"cpu"后,运行命令: python train.py conf/train.json

仍然报错:

Use dataset to generate dict. Size of doc_label dict is 3 Size of doc_token dict is 2629 Size of doc_char dict is 2629 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 Shrink dict over. Size of doc_label dict is 3 Size of doc_token dict is 2396 Size of doc_char dict is 2396 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=2 : out of memory Traceback (most recent call last): File "train.py", line 245, in train(config) File "train.py", line 212, in train trainer.train(train_data_loader, model, optimizer, "Train", epoch) File "train.py", line 101, in train ModeType.TRAIN) File "train.py", line 117, in run for batch in data_loader: File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 178, in _pin_memory_loop batch = pin_memory_batch(batch) File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in pin_memory_batch return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 243, in return {k: pin_memory_batch(sample) for k, sample in batch.items()} File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 239, in pin_memory_batch return batch.pin_memory() RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCCachingHostAllocator.cpp:265

使用的是自己生成的中文多标签数据集,其中每个字是一个token,其他配置没有改变,其中一些配置如下

{ "task_info":{ "label_type": "multi_label", "hierarchical": false, "hierar_taxonomy": "data/fdqb.taxonomy", "hierar_penalty": 0.000001 }, "device": "cpu", "model_name": "TextCNN", "checkpoint_dir": "checkpoint_dir_rcv1", "model_dir": "trained_model_rcv1", "data": { "train_json_files": [ "data/fdqb_train.json" ], "validate_json_files": [ "data/fdqb_dev.json" ], "test_json_files": [ "data/fdqb_test.json" ] ...

请将 visible_device_list 设置为空,默认读取第1块GPU