Open ninsmiracle opened 1 month ago
The aforementioned phenomenon is one of issues triggered by bulkload download.
large alloc 2560917504
After execute the clear_bulk_load_states
function, download_sst_file
tasks still remain, which causes above phenomenon.
operation:we restart one node.
ballot increase,function clear_bulk_load_states_if_needed()
clear 88.5 replica _metadata.files at 15:17:56.753
D2024-05-20 15:17:56.753 (1716189476753079718 146668) replica.replica13.0404000d0000005d: replica_config.cpp:819:update_local_configuration(): 88.5@10.142.98.52:27101: update ballot to init file from 3 to 4 OK
D2024-05-20 15:17:56.753 (1716189476753147052 146668) replica.replica13.0404000d0000005d:clear_bulk_load_states_if_needed(): [88.5@10.142.98.52:27101] prepare to clear bulk load states, current status = replication::bulk_load_status::BLS_DOWNLOADING
D2024-05-20 15:17:56.753 (1716189476753464144 146668) replica.replica13.0404000d0000005d: replica_config.cpp:1045:update_local_configuration(): 88.5@10.142.98.52:27101: status change replication::partition_status::PS_INACTIVE @ 3 => replication::partition_status::PS_PRIMARY @ 4, pre(1, 0), app(0, 0), duration = 3 ms, replica_configuration(pid=88.5, ballot=4, primary=10.142.98.52:27101, status=3, learner_signature=0, pop_all=0, split_sync_to_child=0)
But at 15:17:56.873, the 88.5 replicais still downloading sst file, cause core.
D2024-05-20 15:17:56.873 (1716189476873362400 146626) replica.default7.04010007000000ca: block_service_manager.cpp:181:download_file(): download file(/home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load/33.sst) succeed, file_size = 65930882, md5 = 7a4d3da9250f52b4e31095c1d7042c2f D2024-05-20 15:17:58.348 (1716189478348326864 146626) replica.default7.04010007000000ca: replica_bulk_loader.cpp:479:download_sst_file(): [88.5@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p32_10G/5 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load,f_meta.name 33.sst
operation:app ingest_p4_10G partition 1,bulkload file missing 88.sst,89.sst,90.sst,93.sst
[general]
app_name : ingest_p4_10G
app_id : 95
partition_count : 4
max_replica_count : 3
[replicas]
pidx ballot replica_count primary secondaries
0 8 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st05.bj:27101]
1 7 3/3 c3-hadoop-pegasus-tst-st03.bj:27101 [c3-hadoop-pegasus-tst-st01.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
2 8 3/3 c3-hadoop-pegasus-tst-st04.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
3 3 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st04.bj:27101]
primary replica failed to download file(88.sst) ,and stop downloading all sst file.
log.1.txt:E2024-05-22 14:28:11.231 (1716359291231595252 102084) replica.default1.040100090000072b: replica_bulk_loader.cpp:520:download_sst_file(): [95.1@10.142.102.47:27101] failed to download file(88.sst), error = ERR_CORRUPTION
But meta says continue downloading.
D2024-05-22 14:28:18.983 (1716359298983653491 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
primary replica reports download progress to meta.
D2024-05-22 14:28:18.983 (1716359298983689828 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:879:report_group_download_progress(): [95.1@10.142.102.47:27101] primary = 10.142.102.47:27101, download progress = 89%, status = ERR_CORRUPTION
D2024-05-22 14:28:18.983 (1716359298983703147 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.98.52:27101, download progress = 88%, status=ERR_OK
D2024-05-22 14:28:18.983 (1716359298983714700 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.97.9:27101, download progress = 88%, status=ERR_OK
meta says stop downloading,and clear _metadata.files. However, all download tasks were not terminated successfully.
D2024-05-22 14:28:28.988 (1716359308988487559 102121) replica.replica2.04008ebc00010f46: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
At 14:28:29, download_sst_file task still exists, access _metadata.files, causing core.
D2024-05-22 14:28:29.529 (1716359309529341231 102089) replica.default6.04010000000007b5: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.102.47:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd1/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 92.sst
F2024-05-22 14:28:29.536 (1716359309536349665 102089) replica.default6.04010000000007b5: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
Secondary receives primary replica message to cancel the bulkload task and clear _metadata.files, but does not terminate all download tasks, cause core dump.
Other replicas generate core dumps due to this reason, cause many replica server core dump.
D2024-05-22 14:28:28.992 (1716359308992917139 159129) replica.replica2.04006d6800010139: replica_bulk_loader.cpp:183:on_group_bulk_load(): [95.1@10.142.98.52:27101] receive group_bulk_load request, primary address = 10.142.102.47:27101, ballot = 7, **meta bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING**
D2024-05-22 14:28:30.384 (1716359310384983585 159094) replica.default5.040100080000056a: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 20
F2024-05-22 14:28:30.452 (1716359310452229248 159094) replica.default5.040100080000056a: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
_metadata.files is cleared, cause f_meta.name length in download_sst_file function is very long.
const file_meta &f_meta = _metadata.files[file_index];
const std::string &file_name = utils::filesystem::path_combine(local_dir, f_meta.name);
log.1.txt:F2024-05-20 17:22:49.621 (1716196969621641630 170503) replica.default11.0401000b000000de: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 410828079
log.2.txt:F2024-05-20 17:23:38.595 (1716197018595730772 192879) replica.default10.040100040000002e: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 532715376
log.1.txt:F2024-05-20 17:22:50.77 (1716196970077285996 164438) replica.default11.0401000b000000c6: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 383022703
Bug Report
What did you do? Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.![image](https://github.com/apache/incubator-pegasus/assets/110282526/75792367-fe15-4407-9ab1-c2d4571aa1c0)
What did you see ? There are three kind of coredump in different nodes Type one:
Type two:
Type three: