Muennighoff / vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
https://arxiv.org/abs/2012.07788
MIT License
88 stars 29 forks source link

Issue in training ERNIE models: ValueError from py_reader() function #2

Closed sagarsj42 closed 3 years ago

sagarsj42 commented 3 years ago

I'm trying to train ERNIE models by following the specified procedure. On starting off the finetune code by running any of the train commands, I'm getting an error from the py_reader() function. For instance, I'm trying to train ERNIE-Large using command bash bash/training/ES/hm_ES36.sh. The following is a part of my console output:

------------------------------------------------
task:  [{'task': 'HM', 'num_choice': 1, 'annotations_jsonpath_train': './data/hm/train.jsonl', 'annotations_jsonpath_dev_seen': './data/hm/dev_seenlong.jsonl', 'annotations_jsonpath_traindev': './data/hm/traindev.jsonl', 'annotations_jsonpath_test_unseen': './data/hm/test_unseenlong.jsonl', 'annotations_jsonpath_test_seen': './data/hm/test_seenlong.jsonl', 'feature_lmdb_path': './data/hm/HM_img.tsv', 'gt_feature_lmdb_path': './data/hm/HM_gt_img.tsv', 'unisex_names_table': './data/vcr/unisex_names_table.csv', 'Proprocessor': 'PreprocessorBasic', 'tokenizer_name': 'FullTokenizer', 'fusion_method': 'mul', 'dropout_rate': 0.1, 'max_seq_len': 128, 'use_gt_fea': True, 'shufflekeep_across_task': True, 'shuffle_every_epoch': True, 'task_weight': 1.0, 'task_prefix': 'hm'}]
/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.8/site-packages/paddle/fluid/layers/io.py:720: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  logging.warn(
2021-03-11 19:20:57,267 - WARNING - paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
Traceback (most recent call last):
  File "finetune.py", line 548, in <module>
    main(args)
  File "finetune.py", line 352, in main
    test_pyreader, model_outputs  = model_name(
  File "finetune.py", line 85, in create_vcr_model
    pyreader = fluid.layers.py_reader(
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.8/site-packages/paddle/fluid/layers/io.py", line 723, in py_reader
    return _py_reader(
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.8/site-packages/paddle/fluid/layers/io.py", line 444, in _py_reader
    startup_blk.append_op(
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.8/site-packages/paddle/fluid/framework.py", line 3010, in append_op
    _dygraph_tracer().trace_op(type,
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.8/site-packages/paddle/fluid/dygraph/tracer.py", line 43, in trace_op
    self.trace(type, inputs, outputs, attrs,
ValueError: (InvalidArgument) Python object is not type of St10shared_ptrIN6paddle10imperative7VarBaseEE (at /paddle/paddle/fluid/pybind/imperative.cc:221)
Muennighoff commented 3 years ago

Hmm it looks like some of the data is incorrect, can you share the other commands you use & your CUDA version? (i.e. are you installing the correct req's?; Did you extract the features & cp into the folders?)

I just checked & for me it works fine, below is what it should look like:


task:  [{'task': 'HM', 'num_choice': 1, 'annotations_jsonpath_train': './data/hm/train.jsonl', 'annotations_jsonpath_dev_seen': './data/hm/dev_seenlong.jsonl', 'annotations_jsonpath_traindev': './data/hm/traindev.jsonl', 'annotations_jsonpath_test_unseen': './data/hm/test_unseenlong.jsonl', 'annotations_jsonpath_test_seen': './data/hm/test_seenlong.jsonl', 'feature_lmdb_path': './data/hm/HM_img.tsv', 'gt_feature_lmdb_path': './data/hm/HM_gt_img.tsv', 'unisex_names_table': './data/vcr/unisex_names_table.csv', 'Proprocessor': 'PreprocessorBasic', 'tokenizer_name': 'FullTokenizer', 'fusion_method': 'mul', 'dropout_rate': 0.1, 'max_seq_len': 128, 'use_gt_fea': True, 'shufflekeep_across_task': True, 'shuffle_every_epoch': True, 'task_weight': 1.0, 'task_prefix': 'hm'}]
/opt/conda/lib/python3.7/site-packages/paddle/fluid/clip.py:779: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
  warnings.warn("Caution! 'set_gradient_clip' is not recommended "
theoretical memory usage: 
(18209.21138906479, 19076.31669330597, 'MB')
args.is_distributed: False
W0311 16:59:19.158905  3609 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 60, Driver API Version: 11.0, Runtime API Version: 9.0
W0311 16:59:19.171633  3609 device_context.cc:260] device: 0, cuDNN Version: 7.6.
Load pretraining parameters from ./data/erniesmall/params.
SPLIT: train
Start to load Faster-RCNN detected objects from ./data/hm/HM_img.tsv
Loaded 8596 images in file ./data/hm/HM_img.tsv in 142 seconds.
Start to load Faster-RCNN detected objects from ./data/hm/HM_gt_img.tsv
Loaded 8596 images in file ./data/hm/HM_gt_img.tsv in 104 seconds.
Load 8596 data from split(s) ./data/hm/train.jsonl.
use gt featurre
LEN:  8596
shuffle epoch 0
feed_queue size 30
epoch: 0, progress: 0/0, step: 10, loss: 0.651680, acc: 0.750000
steps: 10
save_steps: 1250
20210311 17:03:40 current learning_rate:0.00000018
used_time: 0.2308497428894043
feed_queue size 30
epoch: 0, progress: 0/0, step: 20, loss: 0.668125, acc: 0.375000
steps: 20
save_steps: 1250
20210311 17:03:42 current learning_rate:0.00000038
used_time: 0.19247126579284668
feed_queue size 30
epoch: 0, progress: 0/0, step: 30, loss: 0.637425, acc: 0.625000
steps: 30
sagarsj42 commented 3 years ago

Thanks, I reset my environment. The only change I had to make was that my opencv-python version was different than in requirements.txt (3.4.2.17), which was not allowed to be installed in my previous conda environment with python 3.8.8. It worked in the fresh env with python 3.7, and this issue was resolved.

However, I am now encountering a new error now for CuDNN. We don't have CuDNN installed on our servers, and don't have the permission to install it. So is there a way to circumvent this requirement for training ERNIE models?

I am pasting a portion of my current error log for reference. (We use CUDA version 10.2.89)

-----------  Configuration Arguments -----------
batch_size: 8
checkpoints: output_hm
combine: False
decay_steps: 13308;19962
do_test: False
do_train: True
do_val: False
epoch: 100
ernie_config_path: ./data/ernielarge/ernie_vil.large.json
exp: experiment
feature_size: 2048
fusion_method: sum
hierarchical_allreduce_inter_nranks: 8
init_checkpoint: ./data/ernielarge/params
is_distributed: False
learning_rate: 1e-05
lr_decay_dict_file: 
lr_decay_ratio: 0.1
lr_scheduler: manual_warmup_decay
max_img_len: 100
max_seq_len: 128
nccl_comm_num: 1
num_features: 50
num_train_steps: 5000
output_file: 
result_file: ./res_tmp
save_steps: 1250
skip_steps: 10
split: train
stop_steps: 2500
subtrain: False
task_group_json: ./conf/hm/task_hm.json
task_name: hm
test_filelist: 
test_split: test
train_filelist: 
use_cuda: True
use_fast_executor: True
use_fuse: False
use_gpu: True
use_hierarchical_allreduce: False
valid_filelist: 
validation_steps: 20000
verbose: False
vocab_path: ./data/ernielarge/vocab.txt
warmup_steps: 500
weight_decay: 0.01
------------------------------------------------
Preparing...
finetuning tasks start
attention_probs_dropout_prob: 0.1
class_attr_size: 401
class_size: 1601
co_hidden_size: 1024
co_intermediate_size: 4096
co_num_attention_heads: 16
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 1024
initializer_range: 0.02
max_position_embeddings: 512
num_attention_heads: 16
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 1024
initializer_range: 0.02
max_position_embeddings: 512
num_attention_heads: 16
num_hidden_layers: 24
sent_type_vocab_size: 4
t_biattention_id: [18, 19, 20, 21, 22, 23]
task_type_vocab_size: 16
type_vocab_size: 2
v_biattention_id: [0, 1, 2, 3, 4, 5]
v_hidden_size: 1024
v_intermediate_size: 4096
v_num_attention_heads: 16
vocab_size: 30522
------------------------------------------------
task:  [{'task': 'HM', 'num_choice': 1, 'annotations_jsonpath_train': './data/hm/train.jsonl', 'annotations_jsonpath_dev_seen': './data/hm/dev_seenlong.jsonl', 'annotations_jsonpath_traindev': './data/hm
/traindev.jsonl', 'annotations_jsonpath_test_unseen': './data/hm/test_unseenlong.jsonl', 'annotations_jsonpath_test_seen': './data/hm/test_seenlong.jsonl', 'feature_lmdb_path': './data/hm/HM_img.tsv', 'g
t_feature_lmdb_path': './data/hm/HM_gt_img.tsv', 'unisex_names_table': './data/vcr/unisex_names_table.csv', 'Proprocessor': 'PreprocessorBasic', 'tokenizer_name': 'FullTokenizer', 'fusion_method': 'mul',
 'dropout_rate': 0.1, 'max_seq_len': 128, 'use_gt_fea': True, 'shufflekeep_across_task': True, 'shuffle_every_epoch': True, 'task_weight': 1.0, 'task_prefix': 'hm'}]
2021-03-12 00:22:26,808-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/clip.py:779: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommen
d a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
  warnings.warn("Caution! 'set_gradient_clip' is not recommended "
theoretical memory usage: 
(39989.27551259995, 41893.52672748566, 'MB')
args.is_distributed: False
W0312 00:22:34.698298 36551 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 9.0
W0312 00:22:34.699123 36551 dynamic_loader.cc:120] Can not find library: libcudnn.so. The process maybe hang. Please try to add the lib path to LD_LIBRARY_PATH.
W0312 00:22:34.699163 36551 dynamic_loader.cc:179] Failed to find dynamic library: libcudnn.so ( libcudnn.so: cannot open shared object file: No such file or directory ) 
 Please specify its path correctly using following ways: 
 Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS. 
 For instance, issue command: export LD_LIBRARY_PATH=... 
 Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled.
/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "finetune.py", line 548, in <module>
    main(args)
  File "finetune.py", line 414, in main
    exe.run(startup_prog)
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
    six.reraise(*sys.exc_info())
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
    return_merged=return_merged)
File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1154, in _run_impl
    use_program_cache=use_program_cache)
  File "/home2/sagarsj42/anaconda3/envs/vilio_ernie/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1229, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::dynload::EnforceCUDNNLoaded(char const*)
3   paddle::platform::CUDADeviceContext::CUDADeviceContext(paddle::platform::CUDAPlace)
4   std::_Function_handler<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > (), std::reference_wrapper<std::_Bind_simple<paddle::platform::EmplaceDe
viceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<pad
dle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::def
ault_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()> > >::_M_invoke(std::_Any_data const&)
5   std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<s
td::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::__future_base::_Result_base::_Deleter>, std::unique_ptr<paddle::platform::DeviceContext, std
::default_delete<paddle::platform::DeviceContext> > > >::_M_invoke(std::_Any_data const&)
6   std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
7   std::__future_base::_Deferred_state<std::_Bind_simple<paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::s
hared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Plac
e const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr<p
addle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >::_M_run_deferred()
8   paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
9   paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
10  paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
11  paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
12  paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

----------------------
Error Message Summary:
----------------------
Error: Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion at (/paddle/paddle/fluid/platform/dynload/cudnn.cc:63)
Muennighoff commented 3 years ago

Hmm I'm not sure if PaddlePaddle can run without cuDNN on GPU. What you could try is just installing pip install paddlepaddle-gpu instead of the paddlepaddle-gpu==1.8.3.post97. Worst case there's also a CPU version, see https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/install_Ubuntu_en.html

sagarsj42 commented 3 years ago

Thanks for your help, the issue was resolved on using a system with cuDNN later. You may close the thread.