aliyun / plugsched

Live upgrade Linux kernel scheduler subsystem
BSD 3-Clause "New" or "Revised" License
82 stars 23 forks source link

src: clean and rebuild the dying tasks #182

Closed Dengerwei closed 1 year ago

Dengerwei commented 1 year ago

When a task goes to dying, there have a gap between removing from init_task list and releasing the task. In this case, we cannot find these tasks from init_task list, so that the rebuilder cannot deal with them including allocation and release memory.

The sched_cpu_dying gives us an idea that is using the pick_next_task to find these tasks. So it is safe.

Co-developed-by: Cruz Zhao CruzZhao@linux.alibaba.com Signed-off-by: Erwei Deng erwei@linux.alibaba.com

anolis-bot commented 1 year ago

@Dengerwei , a new test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/40245

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:x: FAIL

Sorry, your test job failed. Please get the details in the link.

ampresent commented 1 year ago

/retest

anolis-bot commented 1 year ago

@ampresent , the test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/40453

ampresent commented 1 year ago

LGTM, and we need a test case that covers this bug ASAP.

anolis-bot commented 1 year ago

@ampresent , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:x: FAIL

Sorry, your test job failed. Please get the details in the link.

ampresent commented 1 year ago

Oh, no. It says,

/root/scheduler/kernel/sched/mod//sched_rebuild.c:108:11: error: too few arguments to function class->pick_next_task
    next = class->pick_next_task(rq);

please fix it

Dengerwei commented 1 year ago

Oh, no. It says,

/root/scheduler/kernel/sched/mod//sched_rebuild.c:108:11: error: too few arguments to function class->pick_next_task
    next = class->pick_next_task(rq);

please fix it

OK, thanks.

anolis-bot commented 1 year ago

@Dengerwei , the code has been updated, so a new test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/41969

Dengerwei commented 1 year ago

/retest

anolis-bot commented 1 year ago

@Dengerwei , the test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/41975

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:x: FAIL
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

Dengerwei commented 1 year ago

/retest

anolis-bot commented 1 year ago

@Dengerwei , the code has been updated, so a new test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42053

anolis-bot commented 1 year ago

@Dengerwei , the test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42055

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

anolis-bot commented 1 year ago

@Dengerwei , the code has been updated, so a new test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42077

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:x: FAIL
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

Dengerwei commented 1 year ago

/retest

anolis-bot commented 1 year ago

@Dengerwei , the test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42139

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:x: FAIL
mem_pressure_test:white_check_mark: SUCCESS
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:x: FAIL
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

Dengerwei commented 1 year ago

/retest

anolis-bot commented 1 year ago

@Dengerwei , the test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42214

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:x: FAIL
mem_pressure_test:white_check_mark: SUCCESS
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.

ampresent commented 1 year ago

There are many warnings in the dmesg, please fix them

[  750.858487] Hi, scheduler mod is installing!
[  750.859671] scheduler: total initialization time is        1182625 ns
[  750.859672] scheduler module is loading
[  750.859798] ------------[ cut here ]------------
[  750.859799] rq->clock_update_flags < RQCF_ACT_SKIP
[  750.859833] WARNING: CPU: 2 PID: 20 at /root/scheduler/kernel/sched/mod//sched.h:1185 update_curr+0x27a/0x2d0 [scheduler]
[  750.859834] Modules linked in: scheduler(OE+) xt_conntrack(E) ipt_MASQUERADE(E) nft_counter(E) xt_comment(E) nft_compat(E) nft_chain_nat_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nf_tables(E) nfnetlink(E) veth(E) bridge(E) stp(E) llc(E) overlay(E) fuse(E) binfmt_misc(E) mousedev(E) intel_rapl_msr(E) intel_rapl_common(E) nfit(E) intel_powerclamp(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) i2c_piix4(E) aesni_intel(E) glue_helper(E) psmouse(E) pcspkr(E) pvpanic(E) sunrpc(E) sch_fq_codel(E) cirrus(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) crc32c_intel(E) ttm(E) serio_raw(E) drm(E) i2c_core(E) floppy(E) ipmi_devintf(E) ipmi_msghandler(E) [last unloaded: scheduler]
[  750.859852] CPU: 2 PID: 20 Comm: migration/2 Kdump: loaded Tainted: G           OE     4.19.91-26.6.an8.x86_64 #1
[  750.859852] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014
[  750.859860] RIP: 0010:update_curr+0x27a/0x2d0 [scheduler]
[  750.859861] Code: 85 f0 01 00 00 e9 33 ff ff ff 80 3d 3d 07 05 00 00 0f 85 c8 fd ff ff 48 c7 c7 30 8d 83 c0 c6 05 29 07 05 00 01 e8 1c 8d 89 ef <0f> 0b e9 ae fd ff ff 80 3d 16 07 05 00 00 0f 85 23 fe ff ff 48 c7
[  750.859862] RSP: 0000:ffffac94831ffd40 EFLAGS: 00010086
[  750.859863] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
[  750.859863] RDX: 0000000000000026 RSI: ffffffffb1b14c86 RDI: 0000000000000046
[  750.859864] RBP: ffff8b80f478a400 R08: 000000ed0bd5796b R09: ffffac94831ffce0
[  750.859865] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8b81002a2640
[  750.859865] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8b80f6b02080
[  750.859866] FS:  0000000000000000(0000) GS:ffff8b8100280000(0000) knlGS:0000000000000000
[  750.859867] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  750.859867] CR2: 00000000000000f1 CR3: 000000077220a003 CR4: 00000000007706e0
[  750.859869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  750.859870] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  750.859870] PKRU: 55555554
[  750.859871] Call Trace:
[  750.859879]  enqueue_entity+0x426/0xc10 [scheduler]
[  750.859886]  enqueue_task_fair+0xc7/0xc50 [scheduler]
[  750.859890]  ? cpudl_clear+0x2c/0xb0
[  750.859892]  ? _raw_spin_lock+0x5/0x20
[  750.859899]  ? __enable_runtime.part.7+0x7a/0xc0 [scheduler]
[  750.859900]  ? _raw_spin_lock+0x5/0x20
[  750.859906]  ? update_runtime_enabled+0x41/0x80 [scheduler]
[  750.859914]  rebuild_sched_state+0x180/0x273 [scheduler]
[  750.859922]  __sync_sched_install+0xe3/0x1230 [scheduler]
[  750.859924]  multi_cpu_stop+0x6f/0xf0
[  750.859927]  ? cpu_stop_queue_work+0x90/0x90
[  750.859928]  cpu_stopper_thread+0x45/0xf0
[  750.859931]  ? sort_range+0x20/0x20
[  750.859932]  smpboot_thread_fn+0xc5/0x160
[  750.859933]  kthread+0x112/0x130
[  750.859934]  ? kthread_park+0x80/0x80
[  750.859935]  ret_from_fork+0x1f/0x40
[  750.859937] ---[ end trace e73683493518680b ]---
[  750.859993] scheduler load: current cpu number is                8
[  750.859994] scheduler load: current thread number is           241
[  750.859995] scheduler load: stop machine time is            274412 ns
[  750.859995] scheduler load: stop handler time is            112624 ns
[  750.859996] scheduler load: stack check time is              54924 ns
[  750.859996] scheduler load: all the time is                 320552 ns
anolis-bot commented 1 year ago

@Dengerwei , the code has been updated, so a new test job has been submitted. Please wait in patience. The test job url: https://tone.openanolis.cn/ws/2r1b4c7z/test_result/42495

anolis-bot commented 1 year ago

@Dengerwei , The CI test is completed, please check result:

Test CaseTest Result
schedule_testx86_64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:white_check_mark: SUCCESS
mem_pressure_test:white_check_mark: SUCCESS
schedule_testaarch64:white_check_mark: SUCCESS
public_var_test:white_check_mark: SUCCESS
var_uniformity_test:white_check_mark: SUCCESS
cpu_throttle_test:white_check_mark: SUCCESS
domain_rebuild_test:white_check_mark: SUCCESS
sched_syscall_test:x: FAIL
mem_pressure_test:white_check_mark: SUCCESS

Sorry, your test job failed. Please get the details in the link.