Closed sagarsj42 closed 3 years ago
Hmm it looks like some of the data is incorrect, can you share the other commands you use & your CUDA version? (i.e. are you installing the correct req's?; Did you extract the features & cp into the folders?)
I just checked & for me it works fine, below is what it should look like:
task: [{'task': 'HM', 'num_choice': 1, 'annotations_jsonpath_train': './data/hm/train.jsonl', 'annotations_jsonpath_dev_seen': './data/hm/dev_seenlong.jsonl', 'annotations_jsonpath_traindev': './data/hm/traindev.jsonl', 'annotations_jsonpath_test_unseen': './data/hm/test_unseenlong.jsonl', 'annotations_jsonpath_test_seen': './data/hm/test_seenlong.jsonl', 'feature_lmdb_path': './data/hm/HM_img.tsv', 'gt_feature_lmdb_path': './data/hm/HM_gt_img.tsv', 'unisex_names_table': './data/vcr/unisex_names_table.csv', 'Proprocessor': 'PreprocessorBasic', 'tokenizer_name': 'FullTokenizer', 'fusion_method': 'mul', 'dropout_rate': 0.1, 'max_seq_len': 128, 'use_gt_fea': True, 'shufflekeep_across_task': True, 'shuffle_every_epoch': True, 'task_weight': 1.0, 'task_prefix': 'hm'}]
/opt/conda/lib/python3.7/site-packages/paddle/fluid/clip.py:779: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
theoretical memory usage:
(18209.21138906479, 19076.31669330597, 'MB')
args.is_distributed: False
W0311 16:59:19.158905 3609 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 60, Driver API Version: 11.0, Runtime API Version: 9.0
W0311 16:59:19.171633 3609 device_context.cc:260] device: 0, cuDNN Version: 7.6.
Load pretraining parameters from ./data/erniesmall/params.
SPLIT: train
Start to load Faster-RCNN detected objects from ./data/hm/HM_img.tsv
Loaded 8596 images in file ./data/hm/HM_img.tsv in 142 seconds.
Start to load Faster-RCNN detected objects from ./data/hm/HM_gt_img.tsv
Loaded 8596 images in file ./data/hm/HM_gt_img.tsv in 104 seconds.
Load 8596 data from split(s) ./data/hm/train.jsonl.
use gt featurre
LEN: 8596
shuffle epoch 0
feed_queue size 30
epoch: 0, progress: 0/0, step: 10, loss: 0.651680, acc: 0.750000
steps: 10
save_steps: 1250
20210311 17:03:40 current learning_rate:0.00000018
used_time: 0.2308497428894043
feed_queue size 30
epoch: 0, progress: 0/0, step: 20, loss: 0.668125, acc: 0.375000
steps: 20
save_steps: 1250
20210311 17:03:42 current learning_rate:0.00000038
used_time: 0.19247126579284668
feed_queue size 30
epoch: 0, progress: 0/0, step: 30, loss: 0.637425, acc: 0.625000
steps: 30
Thanks, I reset my environment. The only change I had to make was that my opencv-python
version was different than in requirements.txt (3.4.2.17), which was not allowed to be installed in my previous conda environment with python 3.8.8. It worked in the fresh env with python 3.7, and this issue was resolved.
However, I am now encountering a new error now for CuDNN
. We don't have CuDNN
installed on our servers, and don't have the permission to install it. So is there a way to circumvent this requirement for training ERNIE models?
I am pasting a portion of my current error log for reference. (We use CUDA version 10.2.89)
----------- Configuration Arguments -----------
batch_size: 8
checkpoints: output_hm
combine: False
decay_steps: 13308;19962
do_test: False
do_train: True
do_val: False
epoch: 100
ernie_config_path: ./data/ernielarge/ernie_vil.large.json
exp: experiment
feature_size: 2048
fusion_method: sum
hierarchical_allreduce_inter_nranks: 8
init_checkpoint: ./data/ernielarge/params
is_distributed: False
learning_rate: 1e-05
lr_decay_dict_file:
lr_decay_ratio: 0.1
lr_scheduler: manual_warmup_decay
max_img_len: 100
max_seq_len: 128
nccl_comm_num: 1
num_features: 50
num_train_steps: 5000
output_file:
result_file: ./res_tmp
save_steps: 1250
skip_steps: 10
split: train
stop_steps: 2500
subtrain: False
task_group_json: ./conf/hm/task_hm.json
task_name: hm
test_filelist:
test_split: test
train_filelist:
use_cuda: True
use_fast_executor: True
use_fuse: False
use_gpu: True
use_hierarchical_allreduce: False
valid_filelist:
validation_steps: 20000
verbose: False
vocab_path: ./data/ernielarge/vocab.txt
warmup_steps: 500
weight_decay: 0.01
------------------------------------------------
Preparing...
finetuning tasks start
attention_probs_dropout_prob: 0.1
class_attr_size: 401
class_size: 1601
co_hidden_size: 1024
co_intermediate_size: 4096
co_num_attention_heads: 16
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 1024
initializer_range: 0.02
max_position_embeddings: 512
num_attention_heads: 16
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 1024
initializer_range: 0.02
max_position_embeddings: 512
num_attention_heads: 16
num_hidden_layers: 24
sent_type_vocab_size: 4
t_biattention_id: [18, 19, 20, 21, 22, 23]
task_type_vocab_size: 16
type_vocab_size: 2
v_biattention_id: [0, 1, 2, 3, 4, 5]
v_hidden_size: 1024
v_intermediate_size: 4096
v_num_attention_heads: 16
vocab_size: 30522
------------------------------------------------
task: [{'task': 'HM', 'num_choice': 1, 'annotations_jsonpath_train': './data/hm/train.jsonl', 'annotations_jsonpath_dev_seen': './data/hm/dev_seenlong.jsonl', 'annotations_jsonpath_traindev': './data/hm
/traindev.jsonl', 'annotations_jsonpath_test_unseen': './data/hm/test_unseenlong.jsonl', 'annotations_jsonpath_test_seen': './data/hm/test_seenlong.jsonl', 'feature_lmdb_path': './data/hm/HM_img.tsv', 'g
t_feature_lmdb_path': './data/hm/HM_gt_img.tsv', 'unisex_names_table': './data/vcr/unisex_names_table.csv', 'Proprocessor': 'PreprocessorBasic', 'tokenizer_name': 'FullTokenizer', 'fusion_method': 'mul',
'dropout_rate': 0.1, 'max_seq_len': 128, 'use_gt_fea': True, 'shufflekeep_across_task': True, 'shuffle_every_epoch': True, 'task_weight': 1.0, 'task_prefix': 'hm'}]
2021-03-12 00:22:26,808-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/clip.py:779: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommen
d a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
theoretical memory usage:
(39989.27551259995, 41893.52672748566, 'MB')
args.is_distributed: False
W0312 00:22:34.698298 36551 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 9.0
W0312 00:22:34.699123 36551 dynamic_loader.cc:120] Can not find library: libcudnn.so. The process maybe hang. Please try to add the lib path to LD_LIBRARY_PATH.
W0312 00:22:34.699163 36551 dynamic_loader.cc:179] Failed to find dynamic library: libcudnn.so ( libcudnn.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled.
/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "finetune.py", line 548, in <module>
main(args)
File "finetune.py", line 414, in main
exe.run(startup_prog)
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1154, in _run_impl
use_program_cache=use_program_cache)
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1229, in _run_program
fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::dynload::EnforceCUDNNLoaded(char const*)
3 paddle::platform::CUDADeviceContext::CUDADeviceContext(paddle::platform::CUDAPlace)
4 std::_Function_handler<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > (), std::reference_wrapper<std::_Bind_simple<paddle::platform::EmplaceDe
viceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<pad
dle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::def
ault_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()> > >::_M_invoke(std::_Any_data const&)
5 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<s
td::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::__future_base::_Result_base::_Deleter>, std::unique_ptr<paddle::platform::DeviceContext, std
::default_delete<paddle::platform::DeviceContext> > > >::_M_invoke(std::_Any_data const&)
6 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
7 std::__future_base::_Deferred_state<std::_Bind_simple<paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::s
hared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Plac
e const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr<p
addle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >::_M_run_deferred()
8 paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
9 paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
10 paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
11 paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
12 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)
----------------------
Error Message Summary:
----------------------
Error: Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion at (/paddle/paddle/fluid/platform/dynload/cudnn.cc:63)
Hmm I'm not sure if PaddlePaddle can run without cuDNN on GPU.
What you could try is just installing pip install paddlepaddle-gpu
instead of the paddlepaddle-gpu==1.8.3.post97
.
Worst case there's also a CPU version, see https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/install_Ubuntu_en.html
Thanks for your help, the issue was resolved on using a system with cuDNN
later. You may close the thread.
I'm trying to train ERNIE models by following the specified procedure. On starting off the finetune code by running any of the train commands, I'm getting an error from the py_reader() function. For instance, I'm trying to train ERNIE-Large using command
bash bash/training/ES/hm_ES36.sh
. The following is a part of my console output: