NebulousLabs / Sia

Blockchain-based marketplace for file storage. Project has moved to GitLab: https://gitlab.com/NebulousLabs/Sia
https://sia.tech
MIT License
2.71k stars 442 forks source link

Sia host became very slow after unlocking wallet in v1.3.3 #3111

Open starius opened 6 years ago

starius commented 6 years ago

BUG REPORT

Stack Trace or error message

I have a host with a lot of data uploaded (several TB). I updated to v1.3.3 few days ago. Now after I unlock the wallet after 10-30 minutes all siac commands become very slow (sometimes only siac host, sometimes both siac and siac host). They run for more then 10 minutes. SiaHub thinks that the host is down.

Dump of goroutines when it is stuck: https://gist.githubusercontent.com/starius/dcce9bf43197eab55f2a180d8c7b4d3c/raw/4e9275abdb15d7d7bc999bce40e8a627f450c8a2/gistfile1.txt

I think there is a dead lock somewhere in the code.

Is it safe do downgrade to the previous version?

Expected Behavior

Everything works without hanging.

How to reproduce it (as minimally and precisely as possible)

It happened only on one of my hosts. Others work well.

Environment

starius commented 6 years ago

After few hours it self-fixed somehow. Now only 283 goroutines are running.

In transactionpool.log I see a lot of lines like these:

accept.go:335: [DEBUG] Beginning broadcast of transaction set
accept.go:340: [DEBUG] Transaction set broadcast has failed: transaction set contains only duplicate transactions
starius commented 6 years ago

I recorded mutex profiling using runtime.SetMutexProfileFraction(1) and got the following profile.

profile001

PS. It worth enabling this mode as well in profile/ directory.

starius commented 6 years ago

I see RPCRenewContract is involved in the problematic branch in the profile.

starius commented 6 years ago

I think I found the root cause.

managedRPCRenewContract calls managedFinalizeContract which calls managedAddStorageObligation which locks h.mu and under the mutex calls AddSectorBatch which goes through all sectors of the contract and updates counts.

For a big contract it can take a while to update all the counts and all other functions involving h.mu (basically everything related to host) are locked. My suggestion is to avoid locking h.mu when the counts are updated. Counts' updates can be done in background in idempotent way (in case the server crashes). Or just have a tracing GC instead of refcount.

DavidVorick commented 6 years ago

Yes, this is a known issue actually. There are a few potential fixes, the one I would really like to see involves re-doing the way the host stores sectors so that you can just renew them all in constant time. The other thing you can do is write a big WAL entry indicating what sectors need to be updated, and then you update them later without blocking the whole time. It's still a big scalability issue, but at least it does not cause severe blocking.

volvox-globator commented 6 years ago

Same issue here. I had to kill (SIGKILL) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.info and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?

starius commented 6 years ago

Did downgrade to 1.3.2 help?

On Wed, Jun 20, 2018 at 1:52 PM, volvox-globator notifications@github.com wrote:

Same issue here. I had to kill (SIGKILL) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.com and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398707991, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW-zdviTfzzqdhA5c7dQgZNWWUZPbks5t-ilqgaJpZM4Uq7Wr .

-- Best regards, Boris Nagaev

volvox-globator commented 6 years ago

I've downgraded just a while ago, will see. I have similar setup as you - running host on 64 bits Debian Linux, hosted 2+ TB and about 900 contracts. It seems everything is alright now but I have to wait for new contract calls to be sure. I can provide more information if you want.

volvox-globator commented 6 years ago

Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.

starius commented 6 years ago

I think a workaround would be to reject renew if contract size is more than X. But I think the same issue happens when contract is finished as well.

On Wed, Jun 20, 2018 at 4:30 PM, volvox-globator notifications@github.com wrote:

Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398748648, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW3k56oIUonToxLNVFf68XGSelNARks5t-k5lgaJpZM4Uq7Wr .

-- Best regards, Boris Nagaev

volvox-globator commented 6 years ago

After some time it returned back to normal. I suspect there was some huge contract. I'm considering move Sia generated files to the SSD drive to speed up the process in the future.

EvilRedHorse commented 6 years ago

I am having the same error hosting with version 1.3.3, increasing ulimit nofile to 10000+ does not help. First a slow-down on RPC, then Sia-UI crashes after not being able to communicate over RPC/API port 9980. During slowdown, wallet module will get a result, while no result from host module. I opened a ticket, issue #3141, after finding a similar nofile problem from the past.