Closed youan1 closed 6 years ago
看起来是v2的job,可以尝试在event_handler输出gradient和parameter,在日志中观察变化的情况,提供超参调整和模型调整的一些依据。
是V2的job,就是说如果梯度太大,目前的解决办法还是只能通过调整超参数来解决,是吧,有哪些超参数可以在不降低训练速度和效果的情况下得到调整,batch_size会减慢速度把
还是这个问题,能不能请paddle的同学更新一下代码,如果梯度太大了,超过阈值,做一下截断
参考了上面的解法,加入了error_clipping_threshold,还是不稳定,同样的网络结构,有的时候成功,有的时候失败,失败还是浮点数溢出的错误,唯一变化的就是使用机器的节点数,请问还有什么别的方法能保证模型稳定产出
加入error_clipping_threshold的时候,合适的位置以及阈值很重要,可以作相应调节看看。
所有的层都加了,阈值也比较小,30了,已经导致AUC下降了,而且这个不稳定,有时候嫩能过去,有时候不能过去
不稳定的原因可能是超参设置的问题,例如可以尝试一下减小学习率
减少学习率,AUC会下降,还有别的方法么
python train.py
I1107 18:23:17.444944 11700 Util.cpp:166] commandline: --use_gpu=False --trainer_count=12
W1107 18:23:17.444993 11700 CpuId.h:112] PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine and could speed up CPU computations via CMAKE .. -DWITH_AVX=ON
I1107 18:23:17.540668 11700 GradientMachine.cpp:85] Initing parameters..
I1107 18:23:17.684885 11700 GradientMachine.cpp:92] Init parameters done.
Pass 0, Batch 0, Cost 19.568821, {'__sum_evaluator_0__': 0.5740799903869629}
Test with Pass 0, Batch 0, {'__sum_evaluator_0__': 0.5616281628608704}
Pass 0, Batch 2, Cost 60.015236, {'__sum_evaluator_0__': 0.1767834722995758}
Thread [140145163896576] Forwarding __lstmemory_1__,
*** Aborted at 1510050215 (unix time) try "date -d @1510050215" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGFPE (@0x7f7641980eae) received by PID 11700 (TID 0x7f7616b60700) from PID 1100484270; stack trace: ***
@ 0x7f767a079160 (unknown)
@ 0x7f7641980eae paddle::LstmCompute::backwardOneSequence<>()
@ 0x7f76419811fd paddle::LstmCompute::backwardBatch<>()
@ 0x7f764197dd06 paddle::LstmLayer::backwardBatch()
@ 0x7f764197e39e paddle::LstmLayer::backward()
@ 0x7f7641a088d1 paddle::NeuralNetwork::backward()
@ 0x7f7641a12fd2 paddle::TrainerThread::backward()
@ 0x7f7641a1316d paddle::TrainerThread::computeThread()
@ 0x7f766a59f8a0 execute_native_thread_routine
@ 0x7f767a0711c3 start_thread
@ 0x7f767969912d __clone
@ 0x0 (unknown)
Floating point exception (core dumped)
为什么paddle总是出现浮点异常。。。
为什么paddle总是出现浮点异常
计算程序遇到计算溢出本身是挺正常的,特别是对 LSTM 这样的序列级别计算。从上面的日志看,cost 在增大??不知道LSTM层的具体配置(激活、初始化等)都是什么,一般都会先简单调下参数看是否会有所改善。
@lcy-seso 调整了batch size的大小,训练了8k个batch,又挂了。 这是全部代码,根据官方Semantic Role Labeling改的,改动的地方:
辛苦paddle同学看一下~
# -*- coding: utf-8 -*-
import math, os
import numpy as np
from paddle.trainer_config_helpers import *
import paddle.v2 as paddle
import paddle.v2.evaluator as evaluator
from data_utils import load_vocab, get_char_ids, get_tag_ids
# dict
word_dict = load_vocab('./dict/char.dict', 'gb18030', True, True)
label_dict = load_vocab('./dict/tag.dict', 'gb18030', False, False)
black_dict = load_vocab('./dict/black.dict', 'gb18030', False, False)
def my_data_reader(file_path):
def reader():
with open(file_path, 'r') as fdata:
char_ids, tag_ids = [], []
for line in fdata:
line = line.decode('gb18030', 'ignore').strip()
# ....
yield char_ids, tag_ids
return reader
word_dict_len = len(word_dict)
label_dict_len = len(label_dict)
word_dim = 32
hidden_dim = 512
default_std = 1 / math.sqrt(hidden_dim) / 3.0
mix_hidden_lr = 1e-3
def d_type(size):
return paddle.data_type.integer_value_sequence(size)
def db_lstm():
#8 features
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
emb_para = paddle.attr.Param(name='emb', initial_std=0)
std_0 = paddle.attr.Param(initial_std=0.)
std_default = paddle.attr.Param(initial_std=default_std)
word_input = [word]
emb_layers = [
paddle.layer.embedding(size=word_dim, input=x, param_attr=emb_para)
for x in word_input
]
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
layer_attr = ExtraLayerAttribute(drop_rate=0.5)
hidden_para_attr = paddle.attr.Param(
initial_std=default_std, learning_rate=mix_hidden_lr)
lstm_0 = paddle.layer.lstmemory(
input=hidden_0,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
bias_attr=std_0,
param_attr=lstm_para_attr,
layer_attr=layer_attr)
#stack L-LSTM and R-LSTM with direct edges
input_tmp = [hidden_0, lstm_0]
depth = 2
for i in range(1, depth):
mix_hidden = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
])
lstm = paddle.layer.lstmemory(
input=mix_hidden,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
reverse=((i % 2) == 1),
bias_attr=std_0,
param_attr=lstm_para_attr,
layer_attr=layer_attr)
input_tmp = [mix_hidden, lstm]
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
return feature_out
def main():
paddle.init(use_gpu=False, trainer_count=48)
# define network topology
feature_out = db_lstm()
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(
name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr))
crf_dec = paddle.layer.crf_decoding(
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(name='crfw'))
evaluator.sum(input=crf_dec)
#inference_topology = paddle.topology.Topology(layers=crf_dec)
#with open("inference_topology.pkl", 'wb') as f:
# inference_topology.serialize_for_inference(f)
# create parameters
parameters = paddle.parameters.create(crf_cost)
# create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
trainer = paddle.trainer.SGD(
cost=crf_cost,
parameters=parameters,
update_equation=optimizer,
extra_layers=crf_dec)
reader = paddle.batch(
paddle.reader.shuffle(my_data_reader('./data/data.train'), buf_size=8192), batch_size=32)
test_reader = paddle.batch(
paddle.reader.shuffle(my_data_reader('./data/data.dev'), buf_size=8192), batch_size=32)
feeding = {
'word_data': 0,
'target': 1
}
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 2 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if event.batch_id % 1000 == 0:
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, Batch %d, %s" % (
event.pass_id, event.batch_id, result.metrics)
if isinstance(event, paddle.event.EndPass):
# save parameters
with open('params_pass_%d.tar' % event.pass_id, 'w') as f:
trainer.save_parameter_to_tar(f)
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=80,
feeding=feeding)
if __name__ == '__main__':
main()
如题: 训练的时候会抱如下错误,看过去的解决办法,是通过改小batch_size来解决这个问题的,但是我们对训练速度有要求,不能够 把batch_size改的太小,是否还有别的办法?
另外,我们试过将激活函数改为BRelu,但还是出现同样的问题
Thu Sep 7 18:55:22 2017[1,36]: Aborted at 1504781722 (unix time) try "date -d @1504781722" if you are using GNU date
Thu Sep 7 18:55:22 2017[1,36]:PC: @ 0x0 (unknown)
Thu Sep 7 18:55:22 2017[1,36]: SIGFPE (@0x7f77fd251a41) received by PID 51092 (TID 0x7f78034a5700) from PID 18446744073661651521; stack trace:
Thu Sep 7 18:55:22 2017[1,36]: @ 0x7f780307c160 (unknown)
Thu Sep 7 18:55:22 2017[1,36]: @ 0x7f77fd251a41 mkl_blas_avx_sgemm_kernel_0
Thu Sep 7 18:55:24 2017[1,36]:./train.sh: line 239: 51092 Floating point exceptionpython27-gcc482/bin/python conf/trainer_config.conf
Thu Sep 7 18:55:24 2017[1,36]:+ '[' 136 -ne 0 ']'
Thu Sep 7 18:55:24 2017[1,36]:+ kill_pserver2_exit
Thu Sep 7 18:55:24 2017[1,36]:+ ps aux
Thu Sep 7 18:55:24 2017[1,36]:+ grep paddle_pserver2
Thu Sep 7 18:55:24 2017[1,36]:+ grep paddle_cluster_job
Thu Sep 7 18:55:24 2017[1,36]:+ grep -v grep
Thu Sep 7 18:55:24 2017[1,36]:+ cut -c10-14
Thu Sep 7 18:55:24 2017[1,36]:+ xargs kill -9
Thu Sep 7 18:55:24 2017[1,36]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Sep 7 18:55:24 2017[1,36]:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Sep 7 18:55:24 2017[1,36]:[./common.sh : 399] [kill_pserver2_exit]
Thu Sep 7 18:55:24 2017[1,36]:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit'
Thu Sep 7 18:55:24 2017[1,36]:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit
Thu Sep 7 18:55:24 2017[1,36]:+ get_stack
Thu Sep 7 18:55:24 2017[1,36]:+ set +x
Thu Sep 7 18:55:24 2017[1,36]: