apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.99k stars 312 forks source link

Bug(tcmalloc):ALL nodes coredump when doing Bulkload #2006

Open ninsmiracle opened 6 months ago

ninsmiracle commented 6 months ago

Bug Report

  1. What did you do? Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump. image

  2. What did you see ? There are three kind of coredump in different nodes Type one:

    (gdb) #0  0x00007fc9093669ef in signalHandler(int, siginfo*, void*) ()
    from /opt/soft/openjdk1.8.0/jre/lib/amd64/server/libjvm.so
    #1  <signal handler called>
    #2  0x00007fc9072ea793 in dsn::utils::filesystem::get_normalized_path (
    path=..., npath=...)
    at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:116
    #3  0x00007fc9072ebb4d in dsn::utils::filesystem::path_combine (path1=...,
    path2=...)
    at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:618
    #4  0x00007fc9085696e7 in dsn::replication::replica_bulk_loader::download_sst_file (this=0x5a947a40, remote_dir=..., local_dir=...,
    file_index=<optimized out>, fs=0x275f74d0)
    at /home/work/temp/pegasus/src/rdsn/src/replica/bulk_load/replica_bulk_loader.cpp:460
    #5  0x00007fc9086b7811 in dsn::task::exec_internal (
    this=this@entry=0xa95362d0)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task.cpp:176
    #6  0x00007fc9086ccec2 in dsn::task_worker::loop (this=0x24e31e0)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:224
    #7  0x00007fc9086cd040 in dsn::task_worker::run_internal (this=0x24e31e0)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:204
    #8  0x00007fc90734ba1f in execute_native_thread_routine ()
    from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
    #9  0x00007fc905156dc5 in start_thread () from /lib64/libpthread.so.0
    #10 0x00007fc90365573d in clone () from /lib64/libc.so.6
    (gdb) quit

Type two:

#0  0x00007f8d106371d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) #0  0x00007f8d106371d7 in raise () from /lib64/libc.so.6
#1  0x00007f8d106388c8 in abort () from /lib64/libc.so.6
#2  0x00007f8d12435dbb in tcmalloc::Log (mode=mode@entry=tcmalloc::kCrash,
    filename=filename@entry=0x7f8d1244c14e "src/tcmalloc.cc",
    line=line@entry=332, a=..., b=..., c=..., d=...)
    at src/internal_logging.cc:118
#3  0x00007f8d124281d9 in (anonymous namespace)::InvalidFree (
    ptr=<optimized out>) at src/tcmalloc.cc:332
#4  0x00007f8d16adaef3 in _dl_update_slotinfo ()
   from /lib64/ld-linux-x86-64.so.2
#5  0x00007f8d16aca136 in update_get_addr () from /lib64/ld-linux-x86-64.so.2
#6  0x00007f8d1438e71c in dsn::utils::filesystem::get_normalized_path (
    path=..., npath=...)
    at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:116
#7  0x00007f8d1438fb4d in dsn::utils::filesystem::path_combine (path1=...,
    path2=...)
    at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:618
#8  0x00007f8d1560d6e7 in dsn::replication::replica_bulk_loader::download_sst_file (this=0x35ba0c0, remote_dir=..., local_dir=...,
    file_index=<optimized out>, fs=0x8090b1500)
    at /home/work/temp/pegasus/src/rdsn/src/replica/bulk_load/replica_bulk_loader.cpp:460
#9  0x00007f8d1575b811 in dsn::task::exec_internal (
    this=this@entry=0x8d3964f00)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task.cpp:176
#10 0x00007f8d15770ec2 in dsn::task_worker::loop (this=0x1f32fd0)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:224
#11 0x00007f8d15771040 in dsn::task_worker::run_internal (this=0x1f32fd0)
    at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:204
#12 0x00007f8d143efa1f in execute_native_thread_routine ()
   from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
#13 0x00007f8d121fadc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f8d106f973d in clone () from /lib64/libc.so.6
(gdb) quit

Type three:

Program terminated with signal 6, Aborted.
#0  0x00007fad388fa1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) #0  0x00007fad388fa1d7 in raise () from /lib64/libc.so.6
#1  0x00007fad388fb8c8 in abort () from /lib64/libc.so.6
#2  0x00007fad3925ca8d in __gnu_cxx::__verbose_terminate_handler() ()
   from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#3  0x00007fad3925abe6 in ?? ()
   from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#4  0x00007fad3925ac13 in std::terminate() ()
   from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#5  0x00007fad3c6b2a4e in execute_native_thread_routine ()
   from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
#6  0x00007fad3a4bddc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fad389bc73d in clone () from /lib64/libc.so.6
(gdb) quit
  1. What version of Pegasus are you using? v2.4
ruojieranyishen commented 6 months ago

The aforementioned phenomenon is one of issues triggered by bulkload download.

Phenomenon

  1. Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.
  2. bulkload file missing also cause many nodes coredump
  3. Phenomenon 1 and 2 both cause tcmalloc report large alloc 2560917504

Reason

After execute the clear_bulk_load_states function, download_sst_file tasks still remain, which causes above phenomenon.

Case 1 With Phenomenon 1

operation:we restart one node.

ballot increase,function clear_bulk_load_states_if_needed() clear 88.5 replica _metadata.files at 15:17:56.753

D2024-05-20 15:17:56.753 (1716189476753079718 146668) replica.replica13.0404000d0000005d: replica_config.cpp:819:update_local_configuration(): 88.5@10.142.98.52:27101: update ballot to init file from 3 to 4 OK 
D2024-05-20 15:17:56.753 (1716189476753147052 146668) replica.replica13.0404000d0000005d:clear_bulk_load_states_if_needed(): [88.5@10.142.98.52:27101] prepare to clear bulk load states, current status = replication::bulk_load_status::BLS_DOWNLOADING
D2024-05-20 15:17:56.753 (1716189476753464144 146668) replica.replica13.0404000d0000005d: replica_config.cpp:1045:update_local_configuration(): 88.5@10.142.98.52:27101: status change replication::partition_status::PS_INACTIVE @ 3 => replication::partition_status::PS_PRIMARY @ 4, pre(1, 0), app(0, 0), duration = 3 ms, replica_configuration(pid=88.5, ballot=4, primary=10.142.98.52:27101, status=3, learner_signature=0, pop_all=0, split_sync_to_child=0)

But at 15:17:56.873, the 88.5 replicais still downloading sst file, cause core.

D2024-05-20 15:17:56.873 (1716189476873362400 146626) replica.default7.04010007000000ca: block_service_manager.cpp:181:download_file(): download file(/home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load/33.sst) succeed, file_size = 65930882, md5 = 7a4d3da9250f52b4e31095c1d7042c2f D2024-05-20 15:17:58.348 (1716189478348326864 146626) replica.default7.04010007000000ca: replica_bulk_loader.cpp:479:download_sst_file(): [88.5@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p32_10G/5 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load,f_meta.name 33.sst

Case 2 With Phenomenon 2

operation:app ingest_p4_10G partition 1,bulkload file missing 88.sst,89.sst,90.sst,93.sst

[general]
app_name           : ingest_p4_10G
app_id             : 95           
partition_count    : 4            
max_replica_count  : 3            

[replicas]
pidx  ballot  replica_count  primary                              secondaries                                                              
0     8       3/3            c3-hadoop-pegasus-tst-st01.bj:27101  [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st05.bj:27101]  
1     7       3/3            c3-hadoop-pegasus-tst-st03.bj:27101  [c3-hadoop-pegasus-tst-st01.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]  
2     8       3/3            c3-hadoop-pegasus-tst-st04.bj:27101  [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]  
3     3       3/3            c3-hadoop-pegasus-tst-st01.bj:27101  [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st04.bj:27101]  

primary replica

primary replica failed to download file(88.sst) ,and stop downloading all sst file.

log.1.txt:E2024-05-22 14:28:11.231 (1716359291231595252 102084) replica.default1.040100090000072b: replica_bulk_loader.cpp:520:download_sst_file(): [95.1@10.142.102.47:27101] failed to download file(88.sst), error = ERR_CORRUPTION

But meta says continue downloading.

D2024-05-22 14:28:18.983 (1716359298983653491 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING

primary replica reports download progress to meta.

D2024-05-22 14:28:18.983 (1716359298983689828 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:879:report_group_download_progress(): [95.1@10.142.102.47:27101] primary = 10.142.102.47:27101, download progress = 89%, status = ERR_CORRUPTION
D2024-05-22 14:28:18.983 (1716359298983703147 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.98.52:27101, download progress = 88%, status=ERR_OK
D2024-05-22 14:28:18.983 (1716359298983714700 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.97.9:27101, download progress = 88%, status=ERR_OK

meta says stop downloading,and clear _metadata.files. However, all download tasks were not terminated successfully.

D2024-05-22 14:28:28.988 (1716359308988487559 102121) replica.replica2.04008ebc00010f46: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING

At 14:28:29, download_sst_file task still exists, access _metadata.files, causing core.

D2024-05-22 14:28:29.529 (1716359309529341231 102089) replica.default6.04010000000007b5: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.102.47:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd1/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 92.sst
F2024-05-22 14:28:29.536 (1716359309536349665 102089) replica.default6.04010000000007b5: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086

secondary replica

Secondary receives primary replica message to cancel the bulkload task and clear _metadata.files, but does not terminate all download tasks, cause core dump.

Other replicas generate core dumps due to this reason, cause many replica server core dump.

D2024-05-22 14:28:28.992 (1716359308992917139 159129) replica.replica2.04006d6800010139: replica_bulk_loader.cpp:183:on_group_bulk_load(): [95.1@10.142.98.52:27101] receive group_bulk_load request, primary address = 10.142.102.47:27101, ballot = 7, **meta bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING**
D2024-05-22 14:28:30.384 (1716359310384983585 159094) replica.default5.040100080000056a: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 20
F2024-05-22 14:28:30.452 (1716359310452229248 159094) replica.default5.040100080000056a: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086

tcmalloc report large alloc

_metadata.files is cleared, cause f_meta.name length in download_sst_file function is very long.

const file_meta &f_meta = _metadata.files[file_index];
const std::string &file_name = utils::filesystem::path_combine(local_dir, f_meta.name);
log.1.txt:F2024-05-20 17:22:49.621 (1716196969621641630 170503) replica.default11.0401000b000000de: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 410828079
log.2.txt:F2024-05-20 17:23:38.595 (1716197018595730772 192879) replica.default10.040100040000002e: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 532715376
log.1.txt:F2024-05-20 17:22:50.77 (1716196970077285996 164438) replica.default11.0401000b000000c6: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 383022703