alibaba / x-deeplearning

An industrial deep learning framework for high-dimension sparse data
Apache License 2.0
4.26k stars 1.03k forks source link

local模式测试ps_mark_op出core #92

Closed simon1024 closed 4 years ago

simon1024 commented 5 years ago

使用local模式测试ps_mark_op, 在程序结束释放内存时出core 。core并非稳定复现,但是90%以上的情况都出,堆栈可能不同。另外在程序运行的中间也可能出问题,提示:corrupted size vs. prev_size。

分布式场景没有测过,不确定有没有类似问题。麻烦帮忙看看。

代码如下(使用的deepctr例子的样本数据):

def train(options):
    var = xdl.Variable(name="w", dtype=DataType.float, shape=[100,8], vtype = VarType.Hash, initializer=xdl.Ones())

    reader = init_reader("r1", options.input_paths, options.epochs,
                         options.batch_size,options.thread_num)
    batch = reader.read()
    mark_op = xdl.ps_mark_op(
        var_name = "w",
        ids = batch['sparse0'].ids,
        pattern = "g", 
        i = 10) 

    sess = xdl.TrainSession()
    local_step = 0
    while not sess.should_stop():
        for i in range(options.steps):
            ret = sess.run([batch['sparse0'].ids,mark_op])
            print("=========step:%s"%i)
        break   

    print("====================finish=====================")

core堆栈:

(gdb) bt
#0  0x00007f202ddeff45 in _int_free (av=0x7f1e8c000020, p=, have_lock=0) at malloc.c:4005
#1  0x00007f202ddf453c in __GI___libc_free (mem=) at malloc.c:2968
#2  0x00007f1f92990ed4 in ps::Tensor::UnRef (this=0x7f1e8c014f30) at /my-xdl/xdl/xdl/ps-plus/ps-plus/common/tensor.cc:110
#3  0x00007f1f92990a0a in ps::Tensor::~Tensor (this=0x7f1e8c014f30, __in_chrg=) at /my-xdl/xdl/xdl/ps-plus/ps-plus/common/tensor.cc:56
#4  0x00007f1f926e363e in std::default_delete::operator() (this=0x7f1e8c0149d0, __ptr=0x7f1e8c014f30) at /usr/include/c++/5/bits/unique_ptr.h:76
#5  0x00007f1f926e2f13 in std::unique_ptr >::~unique_ptr (this=0x7f1e8c0149d0, __in_chrg=)
    at /usr/include/c++/5/bits/unique_ptr.h:236
#6  0x00007f1f926e47e6 in ps::server::Variable::Slot::~Slot (this=0x7f1e8c0149d0, __in_chrg=) at /my-xdl/xdl/xdl/ps-plus/ps-plus/server/variable.h:35
#7  0x00007f1f926e4806 in std::pair::~pair (this=0x7f1e8c0149c8, __in_chrg=)
    at /usr/include/c++/5/bits/stl_pair.h:96
#8  0x00007f1f926e4832 in __gnu_cxx::new_allocator >::destroy > (this=0x7fff18c12eef, __p=0x7f1e8c0149c8) at /usr/include/c++/5/ext/new_allocator.h:124
#9  0x00007f1f926e4557 in std::allocator_traits > >::destroy > (__a=..., __p=0x7f1e8c0149c8) at /usr/include/c++/5/bits/alloc_traits.h:542
#10 0x00007f1f926e3fd0 in std::__detail::_Hashtable_alloc, true> > >::_M_deallocate_node (this=0x7f1e8c014900, __n=0x7f1e8c0149c0) at /usr/include/c++/5/bits/hashtable_policy.h:1973
#11 0x00007f1f926e3c5a in std::__detail::_Hashtable_alloc, true> > >::_M_deallocate_nodes (this=0x7f1e8c014900, __n=0x0) at /usr/include/c++/5/bits/hashtable_policy.h:1986
#12 0x00007f1f926e3732 in std::_Hashtable, std::allocator >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear (this=0x7f1e8c014900) at /usr/include/c++/5/bits/hashtable.h:1914
#13 0x00007f1f926e3010 in std::_Hashtable, std::allocator >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::~_Hashtable (this=0x7f1e8c014900, __in_chrg=)
    at /usr/include/c++/5/bits/hashtable.h:1225
#14 0x00007f1f926e2a06 in std::unordered_map, std::equal_to, std::allocator > >::~unordered_map (this=0x7f1e8c014900, __in_chrg=) at /usr/include/c++/5/bits/unordered_map.h:98
#15 0x00007f1f926e57de in ps::server::Variable::~Variable (this=0x7f1e8c014870, __in_chrg=) at /my-xdl/xdl/xdl/ps-plus/ps-plus/server/variable.h:29
#16 0x00007f1f926e5842 in std::default_delete::operator() (this=0x7f1e8c009f40, __ptr=0x7f1e8c014870) at /usr/include/c++/5/bits/unique_ptr.h:76
#17 0x00007f1f926e5665 in std::unique_ptr >::~unique_ptr (this=0x7f1e8c009f40, __in_chrg=)
    at /usr/include/c++/5/bits/unique_ptr.h:236
#18 0x00007f1f926e5438 in std::pair > >::~pair (this=0x7f1e8c009f38, 
    __in_chrg=) at /usr/include/c++/5/bits/stl_pair.h:96
#19 0x00007f1f926e5464 in __gnu_cxx::new_allocator > > >::destroy > > > (this=0x7fff18c130bf, __p=0x7f1e8c009f38)
    at /usr/include/c++/5/ext/new_allocator.h:124
#20 0x00007f1f926e523b in std::allocator_traits > > > >::destroy > > > (__a=..., __p=0x7f1e8c009f38)
    at /usr/include/c++/5/bits/alloc_traits.h:542
#21 0x00007f1f926e4eca in std::__detail::_Hashtable_alloc > >, true> > >::_M_deallocate_node (this=0xbfbb60, __n=0x7f1e8c009f30) at /usr/include/c++/5/bits/hashtable_policy.h:1973
#22 0x00007f1f926e4c06 in std::__detail::_Hashtable_alloc > >, true> > >::_M_deallocate_nodes (this=0xbfbb60, __n=0x0) at /usr/include/c++/5/bits/hashtable_policy.h:1986
#23 0x00007f1f926e4954 in std::_Hashtable > >, std::allocator > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear (this=0xbfbb60) at /usr/include/c++/5/bits/hashtable.h:1914
#24 0x00007f1f926e4664 in std::_Hashtable > >, std::allocator > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::~_Hashtable (this=0xbfbb60, __in_chrg=) at /usr/include/c++/5/bits/hashtable.h:1225
#25 0x00007f1f926e429e in std::unordered_map >, std::hash, std::equal_to, std::allocator > > > >::~unordered_map (
    this=0xbfbb60, __in_chrg=) at /usr/include/c++/5/bits/unordered_map.h:98
songyue1104 commented 5 years ago

local模式跑ps_mark_op 和ps_filter_op确实会有问题,这个我们记一下

simon1024 commented 5 years ago

local模式跑ps_mark_op 和ps_filter_op确实会有问题,这个我们记一下

在分布式模式下也跑了下,也core了。非必现,在mark的ids数量大于20000时出的概率很高。 堆栈如下:

(gdb) bt
#0  0x0000000000ce191d in ps::server::udf::ScalarIntegerLogger::SimpleRun (this=0x7f4a4365e100, ctx=0x7f4a449f5080, slices=..., slot_name="g", pval=@0x7f4a4340e1c8: 723)
    at /xdl/xdl/ps-plus/ps-plus/server/udf/scalar_integer_logger.cc:35
#1  0x0000000000ce238b in ps::server::udf::simple_udf_helper::SimpleRunHelper, ps::server::Slices, std::string, long> (udf=0x7f4a4365e100, ctx=0x7f4a449f5080, datas=std::vector of length 3, capacity 3 = {...}) at /xdl/xdl/ps-plus/ps-plus/server/udf/simple_udf_helper.h:160
#2  0x0000000000ce2111 in ps::server::udf::SimpleUdf::Run (this=0x7f4a4365e100, ctx=0x7f4a449f5080)
    at /xdl/xdl/ps-plus/ps-plus/server/udf/simple_udf.h:56
#3  0x0000000000ca6f41 in ps::server::UdfChain::Process (this=0x7f4a4365e280, ctx=0x7f4a449f5080) at /xdl/xdl/ps-plus/ps-plus/server/udf_manager.cc:103
#4  0x0000000000c819a1 in ps::server::Server::RunUdfChain (this=0x7f4b6cc514c0, ver=-2511000256829024200, udf=9212264516796315641, variable_name="emb", 
    inputs=std::vector of length 5, capacity 5 = {...}, ctx=0x7f4a449f5080) at /xdl/xdl/ps-plus/ps-plus/server/server.cc:68
#5  0x0000000000c8b19f in ps::server::ServerService::Process (this=0x7ffec9eb6950, inputs=std::vector of length 8, capacity 8 = {...}, outputs=0x7f4a436212a8)
    at /xdl/xdl/ps-plus/ps-plus/server/server_service.cc:195
#6  0x0000000000c8993e in ps::server::ServerService:: >&, std::vector >*, ps::service::seastar::DoneClosure*)>::operator()(const std::vector > &, std::vector > *, ps::service::seastar::DoneClosure *) const (__closure=0x7f4a449f5300, inputs=std::vector of length 8, capacity 8 = {...}, outputs=0x7f4a436212a8, done=0x7f4a4361d4f0)
    at /xdl/xdl/ps-plus/ps-plus/server/server_service.cc:88
(gdb) f 0
#0  0x0000000000ce191d in ps::server::udf::ScalarIntegerLogger::SimpleRun (this=0x7f4a4365e100, ctx=0x7f4a449f5080, slices=..., slot_name="g", pval=@0x7f4a4340e1c8: 723)
    at /xdl/xdl/ps-plus/ps-plus/server/udf/scalar_integer_logger.cc:35
35            data[slice] = val;
(gdb) p slices
$1 = (const ps::server::Slices &) @0x7f4a4365e5c8: {slice_size = 1, slice_id = std::vector of length 26314, capacity 26314 = {8198, 1009760, 324, 18446744073709551614, 
    230800, 527713, 10705, 18446744073709551614, 4618, 4909, 108, 1009756, 1310, 1009755, 18446744073709551614, 46354, 785, 782, 10695, 527709, 18446744073709551614, 
    246666, 18446744073709551614, 776, 1181, 1009753, 219274, 652, 972, 1175, 213908, 1009748, 388135, 691637, 6390, 18446744073709551614, 18446744073709551614, 
    18446744073709551614, 1588, 4426, 213904, 18446744073709551614, 18446744073709551614, 1009743, 218924, 24033, 1009742, 213877, 2397, 213187, 8663, 502548, 
    18446744073709551614, 746, 742, 220995, 5194, 1009628, 736, 222279, 213872, 213929, 24025, 948, 18446744073709551614, 733, 527686, 39607, 945, 228695, 55, 527934, 
    1196, 235324, 41060, 213918, 1009736, 213917, 18446744073709551614, 1009734, 10658, 523907, 16797, 18446744073709551614, 18446744073709551614, 1009731, 703, 523385, 
    523384, 692, 18446744073709551614, 689, 1009728, 213667, 929, 17, 60917, 26672, 18446744073709551614, 684, 683, 18446744073709551614, 233410, 213900, 3662, 676, 
    213429, 267739, 246634, 328, 523375, 212940, 26660, 810, 527910, 527908, 1203, 803, 3735, 1009723, 219233, 1131, 227841, 3730, 1009722, 5420, 1675, 798, 431, 1193, 
    430, 691625, 791, 1120, 8122, 18446744073709551614, 1009717, 523771, 1113, 4602, 26305, 18446744073709551614, 778, 21768, 18446744073709551614, 775, 212926, 
    18446744073709551614, 1099, 222246, 523764, 18446744073709551614, 763, 388107, 758, 527661, 751, 216499, 219525, 6463, 1506, 738, 46335, 18446744073709551614, 1066, 
    865, 213834, 213491, 1009606, 241560, 1009710, 18446744073709551614, 1009709, 3205, 18446744073709551614, 732, 523862, 18446744073709551614, 18446744073709551614, 
    213449, 1815, 724, 8093, 527888, 1189, 18446744073709551614, 213914, 710, 1009706, 18446744073709551614, 705, 843, 18446744073709551614, 213827, 10648, 1009602, 
    41045, 223411, 697, 527648...}, dim_part = 1, variable = 0x7f4a435b9560, writable = false}

用来查询slot的slice id有很多是18446744073709551614, 导致访问tensor里的数据越界。我自己看了一遍build_hash_slice,应该只会产生正常的id,-1和-2, 不知道这个数是怎么出来的。

guoxinyang commented 5 years ago

18446744073709551614 这个数就是-2,只不过被转成size_t之后显示成这样了,我提了一个pull request,可以fix。麻烦你待会重新pull编译下@simon1024

simon1024 commented 5 years ago

18446744073709551614 这个数就是-2,只不过被转成size_t之后显示成这样了,我提了一个pull request,可以fix。麻烦你待会重新pull编译下@simon1024

赞, 这么快就fix了,一会我试试

simon1024 commented 5 years ago

18446744073709551614 这个数就是-2,只不过被转成size_t之后显示成这样了,我提了一个pull request,可以fix。麻烦你待会重新pull编译下@simon1024

测试通过