Closed gmazurki closed 3 months ago
Hi @gmazurki, This issue is reproduced in nSIM, I'll see what the problem might be.
For loading/unloading modules please check that CONFIG_MODULE_UNLOAD=y in Linux config. I manage to load and unload the scftorture several times until it freezes.
Yes, the option CONFIG_MODULE_UNLOAD is set to y in my case.
@gmazurki may I ask you to test this temporary fix on hardware?
In function: https://elixir.bootlin.com/linux/v5.16.20/source/kernel/time/timekeeping_internal.h#L30
add please "if" check:
static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
{
if (now < last)
return 0;
return (now - last) & mask;
}
Since I use nSIM and not hardware, I would like to check that in my case and in your case this is the same problem.
With the change in the code you proposed I see the same result - hang message + inability to unload the module. I think it would make sense to reproduce it on your FPGA with multicore cluster.
Understood. The symptoms of launching in nSIM are absolutely the same as you described above but the root cause is different then. Ok, I'll try to get HAPS. Thanks.
Just in case, nSIM is able to run SMP configurations with cluster simulation. I use HS58x3 config.
nSIM is able to run SMP configurations with cluster simulation
Is it possible with the free version of the nSim? I do have multicore configuration (tcf), but only one core starts, rest of them time out, so the Linux runs only on single core.
For SMP configs you need to use nSIM + MetaWare debugger anyway. Standalone nsimdrv can't run multicore configs. I have never used free nSIM version. I would suggest to raise this question in e-mail and add me to the thread.
Can you try this config change? CONFIG_ARC_HAS_LLSC is not set CONFIG_ARC_HAS_ATLD=y
I changed the config as you suggested. In this configuration the scftorture module works as expected - no "hang" messages and the module can be loaded/unloaded.
However, there is a problem with booting - the kernel hangs in the middle of booting and continues only if JTAG OpenOCD is attached. Maybe it sends resume command?
What does that prove? What is the valid configuration for HS58? LLSC or ATLD?
I tried once again with a clean build (I built everything from scratch). Now Linux boots properly without any strange behavior described previously. scftorture module works fine as well.
I think we can close this issue, but it would be worth to explain why this change was needed. Both options have no description in the kernel menuconfig.
EDIT:
I found relevant commits with a bit of explanation: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/da7891b5f2e42e85c5ded4a1588d4d0f2598dfd0 https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/bc57bd4fa52b54bb61bd9040e01910341777da23
"LLSC could be not desirable because of llock/scond livelock issue"
You can use only ATLD and disable LLSC, it is absolutely valid configuration. Those commits you found are exactly our case. We will enable LLSC as soon as it will be valid to use, but for now I will disable this option in the kernel configuration.
The issue with hang during booting was most likely caused by the fact that I did an incremental build, most likely not everything was rebuilt. With a clean build we don't see this issue anymore.
@gmazurki Do you mean even with LLSC=y, no issue after clean build? Anyway ATLD=y LLSC=no is the right config to use now.
I referred to ATLD configuration. Initially we experienced the boot issue with ATLD turned on, but with clean build it works OK. So we switched to ATLD now.
EDIT: unfortunately the issue with Linux hanged during booting shows up occasionally and is build specific, i.e. for one buildroot build this does show up (~20 % repro rate) but for another build this does not happen at all. I'll raise another Issue for this specific problem.
I see somewhat similar issue in rcutorture module. After loading rcutorture and waiting a bit some task hangs and the following message is printed:
# modprobe rcutorture
rcu-torture:--- Start of test: nreaders=2 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 stall_cpu_block=0 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 read_exit_delay=13 read_exit_burst=16 nocbs_nthreads=0 nocbs_toggle=1000
...
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:221 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 221 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x286/0x630
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0xe8/0x120
ret_from_fork+0x14/0x18
This problem is not solved by using ATLD. I'm not sure if this bug should be reopened or I should raise a separate issue.
The scftorture kernel module causes CPU core(s) to hang while executing the torture test.
The module cannot be unloaded, doing so causes rmmod command to hang. Console becomes unresponsive.
Usually it is not possible to execute any command on the console, looks like the CPUs are busy doing the torture test, so no CPU cycle are given to other processes.
I checked on a different platform to see what should we expect from it, and on raspberry pi 4 the module does not generate "BUG: soft lockup" messages and can be loaded/unloaded successfully many times.
We have 3xHS58 system.
Note rcutorture and locktorture modules behave correctly and don't show up any worrying messages. Can be loaded/unloaded any time.
I looked at some registers to check where the core is spinning, the location in the listing is address 813092a8: