filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.25k forks source link

scheduler logic bug #8624

Closed dafsic closed 2 years ago

dafsic commented 2 years ago

Checklist

Lotus component

Lotus Version

lotus version 1.15.2+mainnet+git.d1eeb7ca3.dirty
lotus-miner version 1.15.2+mainnet+git.d1eeb7ca3.dirty

Describe the Bug

00000000 53799 4fb43665 worker-185 C2 assigned(1) 15m19.4s 00000000 53804 4fb43665 worker-185 PC2 prepared 14m30.9s 6f1b0aad 53803 4fb43665 worker-185 PC2 running 14m30.9s 00000000 53798 54346696 worker-186 C2 prepared 16m16.5s a0f8b7fa 53797 54346696 worker-186 C2 running 16m23.6s 00000000 53815 54346696 worker-186 PC1 prepared 24.1s 8063bf87 53810 54346696 worker-186 PC1 running 1h21m52.5s 02064e90 53811 54346696 worker-186 PC1 running 1h5m51.2s 8125858b 53809 54346696 worker-186 PC1 running 2h32m22.3s ce327ba8 53808 54346696 worker-186 PC1 running 2h57m24.8s e8f83bf5 53807 54346696 worker-186 PC1 running 2h57m24.8s 5eba0687 53813 54346696 worker-186 PC1 running 30m10s 58aaa875 53812 54346696 worker-186 PC1 running 50m55.5s f8bc5cc4 53814 54346696 worker-186 PC1 running 8m0.9s

All PC1s is on worker-186, some of these should in worker-185, right?

These machines have the exact same hardware configuration

Logging Information

00000000 53799 4fb43665 worker-185 C2 assigned(1) 15m19.4s
00000000 53804 4fb43665 worker-185 PC2 prepared 14m30.9s
6f1b0aad 53803 4fb43665 worker-185 PC2 running 14m30.9s
00000000 53798 54346696 worker-186 C2 prepared 16m16.5s
a0f8b7fa 53797 54346696 worker-186 C2 running 16m23.6s
00000000 53815 54346696 worker-186 PC1 prepared 24.1s
8063bf87 53810 54346696 worker-186 PC1 running 1h21m52.5s
02064e90 53811 54346696 worker-186 PC1 running 1h5m51.2s
8125858b 53809 54346696 worker-186 PC1 running 2h32m22.3s
ce327ba8 53808 54346696 worker-186 PC1 running 2h57m24.8s
e8f83bf5 53807 54346696 worker-186 PC1 running 2h57m24.8s
5eba0687 53813 54346696 worker-186 PC1 running 30m10s
58aaa875 53812 54346696 worker-186 PC1 running 50m55.5s
f8bc5cc4 53814 54346696 worker-186 PC1 running 8m0.9s

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...' ...
Reiers commented 2 years ago

Hi @dafsic

Thanks for the report. Looks like you are running Lotus with custom code or adjustment .dirty

Please upgrade to stock lotus, make clean all !

Im unable to reproduce the issue here, did you upgrade all workers? Did turn off dealmaking during the upgrade? Did you upgrade with sectors in the pipeline?

If the issue persist - leave a comment on here with new logs and repro steps.

Thank you !

dafsic commented 2 years ago

Thanks Reply

dafsic commented 2 years ago

I know the reason, when there is a new AP or P1 task is coming, the scheduler will compare which worker has lower resource usage. But when a worker has a P2 or C2 task, even if the worker has no other tasks, its usage is 100% because its GPU is used (only one GPU). So even if other workers already have multiple P1 tasks, as long as the GPU is not being used, the AP or P1 task will still be assigned to it. So when assigning AP or P1 tasks, you should only compare CPU utilization.

dafsic commented 2 years ago

@Reiers hope to enhance