filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.84k stars 1.26k forks source link

in PC1, all three producer thread was binded to core group 0 #5076

Closed longmaosen closed 3 years ago

longmaosen commented 3 years ago

I got 3 pc1 workers on same machine, and i set FIL_PROOFS_USE_MULTICORE_SDR=1, taskset -c 0,1,2,3 lotus-worker run,taskset -c 4,5,6,7 lotus-worker run,taskset -c 8,9,10,11 lotus-worker run,i excpect every worker takes its cpuset(work1:0,1,2,3 ;work2:4,5,6,7;work3:8,9,10,11) and every worker seals 1 layer in 20 minutes, but its result is that three worker's core all bind to core_group 0(cpuset 0,1,2,3), average layer takes 40 minutes.

2020-11-29T23:28:10.477 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase1 2020-11-29T23:28:10.477 INFO storage_proofs_porep::stacked::vanilla::graph > using parent_cache[2048 / 1073741824] 2020-11-29T23:28:10.477 INFO storage_proofs_porep::stacked::vanilla::cache > parent cache: opening /data/cpfs/PROOFS_PARENT/v28-sdr-parent-21981246c370f9d76c7a77ab273d94bde0ceb4e938292334960bce05585dc117.cache, verify enabled: false 2020-11-29T23:28:10.477 INFO storage_proofs_porep::stacked::vanilla::proof > multi core replication 2020-11-29T23:28:10.477 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > create labels 2020-11-29T23:28:10.542 DEBUG storage_proofs_porep::stacked::vanilla::cores > Cores: 128, Shared Caches: 32, cores per cache (group_size): 4 2020-11-29T23:28:10.542 DEBUG storage_proofs_porep::stacked::vanilla::cores > checked out core group 0 2020-11-29T23:28:10.542 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in main thread 2020-11-29T23:28:10.542 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 0 2020-11-29T23:28:10.542 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 0 2020-11-29T23:28:10.559 INFO storage_proofs_porep::stacked::vanilla::memory_handling > initializing cache

2020-11-29T23:28:57.189 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Layer 1 2020-11-29T23:28:57.190 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Creating labels for layer 1

4 5

jennijuju commented 3 years ago

whats the hardware you are using?

maxmalong commented 3 years ago

PC1 WORKER: cpu AMD EPYC 7H12; 2048GiB mem lotus version 1.2.1

qiusugang commented 3 years ago

I have same problem, It seem that PC1 worker is not effective for taskset specified CPU!

kimimhong commented 3 years ago

I have the same problem. Epyc 7F32 When only 1 process is executed, it falls into 0~3 and progresses as multi-core, but even when 6 processes are executed, all processes go to 0~3, and the same phenomenon occurs even if taskset is assigned.

cwhiggins commented 3 years ago

EPYC 7272 512Gib Ram Just started to experiment with multicore yesterday. I see a 20% drop in time from a little over 5 hours to 4 hours flat, for up to two PC1 tasks on the same worker. If I add another worker and add a PC1 task the system slows down.
All workers were started using taskset to specify cpu affinity, PID 8080 uses three cores 0-2 which it was not set to use, causing a slow down. Looking at hwloc-ps p showed this

7059 PU:12 PU:13 lotus-worker //this is my add piece worker 7100 PU:0 PU:1 PU:2 PU:3 PU:4 PU:5 lotus-worker //PC1 worker #1 7144 PU:14 PU:15 lotus-worker //PC2 worker 8080 PU:0 PU:1 PU:2 PU:6 PU:7 PU:8 PU:9 PU:10 PU:11 lotus-worker //PC1 worker #2

So running two PC1 tasks on the first worker runs smoothly, when a task is added to the second worker it slows down as it then tries to use cores already being used, 0-2, which should not even be assigned to that PID.

lotus version
Daemon:  1.4.0+git.e9989d0e4+api1.0.0
Local: lotus version 1.4.0+git.e9989d0e4
github-actions[bot] commented 3 years ago

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 24 hours.

github-actions[bot] commented 3 years ago

This issue was closed because it is missing author input.