Open ninsmiracle opened 6 months ago
The aforementioned phenomenon is one of issues triggered by bulkload download.
large alloc 2560917504
After execute the clear_bulk_load_states
function, download_sst_file
tasks still remain, which causes above phenomenon.
operation:we restart one node.
ballot increase,function clear_bulk_load_states_if_needed()
clear 88.5 replica _metadata.files at 15:17:56.753
D2024-05-20 15:17:56.753 (1716189476753079718 146668) replica.replica13.0404000d0000005d: replica_config.cpp:819:update_local_configuration(): 88.5@10.142.98.52:27101: update ballot to init file from 3 to 4 OK
D2024-05-20 15:17:56.753 (1716189476753147052 146668) replica.replica13.0404000d0000005d:clear_bulk_load_states_if_needed(): [88.5@10.142.98.52:27101] prepare to clear bulk load states, current status = replication::bulk_load_status::BLS_DOWNLOADING
D2024-05-20 15:17:56.753 (1716189476753464144 146668) replica.replica13.0404000d0000005d: replica_config.cpp:1045:update_local_configuration(): 88.5@10.142.98.52:27101: status change replication::partition_status::PS_INACTIVE @ 3 => replication::partition_status::PS_PRIMARY @ 4, pre(1, 0), app(0, 0), duration = 3 ms, replica_configuration(pid=88.5, ballot=4, primary=10.142.98.52:27101, status=3, learner_signature=0, pop_all=0, split_sync_to_child=0)
But at 15:17:56.873, the 88.5 replicais still downloading sst file, cause core.
D2024-05-20 15:17:56.873 (1716189476873362400 146626) replica.default7.04010007000000ca: block_service_manager.cpp:181:download_file(): download file(/home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load/33.sst) succeed, file_size = 65930882, md5 = 7a4d3da9250f52b4e31095c1d7042c2f D2024-05-20 15:17:58.348 (1716189478348326864 146626) replica.default7.04010007000000ca: replica_bulk_loader.cpp:479:download_sst_file(): [88.5@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p32_10G/5 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load,f_meta.name 33.sst
operation:app ingest_p4_10G partition 1,bulkload file missing 88.sst,89.sst,90.sst,93.sst
[general]
app_name : ingest_p4_10G
app_id : 95
partition_count : 4
max_replica_count : 3
[replicas]
pidx ballot replica_count primary secondaries
0 8 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st05.bj:27101]
1 7 3/3 c3-hadoop-pegasus-tst-st03.bj:27101 [c3-hadoop-pegasus-tst-st01.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
2 8 3/3 c3-hadoop-pegasus-tst-st04.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
3 3 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st04.bj:27101]
primary replica failed to download file(88.sst) ,and stop downloading all sst file.
log.1.txt:E2024-05-22 14:28:11.231 (1716359291231595252 102084) replica.default1.040100090000072b: replica_bulk_loader.cpp:520:download_sst_file(): [95.1@10.142.102.47:27101] failed to download file(88.sst), error = ERR_CORRUPTION
But meta says continue downloading.
D2024-05-22 14:28:18.983 (1716359298983653491 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
primary replica reports download progress to meta.
D2024-05-22 14:28:18.983 (1716359298983689828 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:879:report_group_download_progress(): [95.1@10.142.102.47:27101] primary = 10.142.102.47:27101, download progress = 89%, status = ERR_CORRUPTION
D2024-05-22 14:28:18.983 (1716359298983703147 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.98.52:27101, download progress = 88%, status=ERR_OK
D2024-05-22 14:28:18.983 (1716359298983714700 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [95.1@10.142.102.47:27101] secondary = 10.142.97.9:27101, download progress = 88%, status=ERR_OK
meta says stop downloading,and clear _metadata.files. However, all download tasks were not terminated successfully.
D2024-05-22 14:28:28.988 (1716359308988487559 102121) replica.replica2.04008ebc00010f46: replica_bulk_loader.cpp:71:on_bulk_load(): [95.1@10.142.102.47:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
At 14:28:29, download_sst_file task still exists, access _metadata.files, causing core.
D2024-05-22 14:28:29.529 (1716359309529341231 102089) replica.default6.04010000000007b5: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.102.47:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd1/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 92.sst
F2024-05-22 14:28:29.536 (1716359309536349665 102089) replica.default6.04010000000007b5: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
Secondary receives primary replica message to cancel the bulkload task and clear _metadata.files, but does not terminate all download tasks, cause core dump.
Other replicas generate core dumps due to this reason, cause many replica server core dump.
D2024-05-22 14:28:28.992 (1716359308992917139 159129) replica.replica2.04006d6800010139: replica_bulk_loader.cpp:183:on_group_bulk_load(): [95.1@10.142.98.52:27101] receive group_bulk_load request, primary address = 10.142.102.47:27101, ballot = 7, **meta bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING**
D2024-05-22 14:28:30.384 (1716359310384983585 159094) replica.default5.040100080000056a: replica_bulk_loader.cpp:479:download_sst_file(): [95.0@10.142.98.52:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 20
F2024-05-22 14:28:30.452 (1716359310452229248 159094) replica.default5.040100080000056a: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
_metadata.files is cleared, cause f_meta.name length in download_sst_file function is very long.
const file_meta &f_meta = _metadata.files[file_index];
const std::string &file_name = utils::filesystem::path_combine(local_dir, f_meta.name);
log.1.txt:F2024-05-20 17:22:49.621 (1716196969621641630 170503) replica.default11.0401000b000000de: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 410828079
log.2.txt:F2024-05-20 17:23:38.595 (1716197018595730772 192879) replica.default10.040100040000002e: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 532715376
log.1.txt:F2024-05-20 17:22:50.77 (1716196970077285996 164438) replica.default11.0401000b000000c6: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 383022703
Bug Report
What did you do? Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.
What did you see ? There are three kind of coredump in different nodes Type one:
Type two:
Type three: