Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.83k stars 2.02k forks source link

[BUG] Farmer faulted 2 times #500

Closed etemiz closed 3 years ago

etemiz commented 3 years ago

Describe the bug Farmer has been unstable and closes abruptly.

To Reproduce Steps to reproduce the behavior:

  1. npm run electron &
  2. cat ~/.chia/.../log/debug.log

Expected behavior Run indefinitely

Screenshots

Screenshot from 2020-10-29 19-37-44 Screenshot from 2020-10-30 14-18-27

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

hoffmang9 commented 3 years ago

I want to confirm two things:

  1. This is Beta 17?
  2. You're using both a remote and local harvester?
etemiz commented 3 years ago

Yes beta 17. Yes, I have 1 local harvester, 5 remote harvesters and 1 local farmer.

mmulsow commented 3 years ago

I hit a similar issue with my farmer crashing unexpectedly. I have two nodes that are each running their own full node, farmer, and harvester. The full node and harvester continued to run but the farmer on both machines died after several days of running fine.

The farmer seems to have died a while back and I didn't immediately notice. All messages from the farmer disappeared after the crash:

$ grep -c farmer .chia/beta-1.0b17/log/debug.log.*
.chia/beta-1.0b17/log/debug.log.1:0
.chia/beta-1.0b17/log/debug.log.2:0
.chia/beta-1.0b17/log/debug.log.3:0
.chia/beta-1.0b17/log/debug.log.4:14575
.chia/beta-1.0b17/log/debug.log.5:15057
.chia/beta-1.0b17/log/debug.log.6:11195

And here is the last mentions of the string farmer in debug.log.4 on the first machine:

05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('119.253.42.49', 60355)
05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('108.211.109.173', 8444)
05:43:48.511 farmer farmer_server              : INFO     -> header_signature to peer ('127.0.0.1', 8444)
05:43:48.512 farmer farmer_server              : INFO     <- respond_signature from peer ('127.0.0.1', 45136)
05:43:48.671 full_node full_node_server        : INFO     <- header_signature from peer ('127.0.0.1', 38466)
05:43:48.673 full_node full_node_server        : INFO     <- respond_unfinished_block from peer ('24.186.76.230', 8444)
05:43:48.676 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.6.133.56', 8444)
05:43:48.677 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.149.90.80', 8444)
05:43:48.678 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('173.28.194.176', 55229)
05:43:48.678 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('12.181.169.147', 39878)
05:43:50.118 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('12.181.169.147', 39878)
05:43:50.119 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('108.211.109.173', 8444)
05:43:50.120 full_node full_node_server        : INFO     <- ping from peer ('108.211.109.173', 8444)
05:43:50.121 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('78.183.106.140', 52854)
05:43:50.122 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('85.6.133.56', 8444)
05:43:50.123 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('24.186.76.230', 8444)
05:43:50.124 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('119.253.42.49', 60355)
05:43:50.125 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('108.211.109.173', 8444)
05:43:50.126 full_node full_node_server        : INFO     -> pong to peer ('108.211.109.173', 8444)
05:43:51.576 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:51.581 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.6.133.56', 8444)
05:43:51.582 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.149.90.80', 8444)
05:43:51.582 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('119.253.42.49', 60355)
05:43:51.583 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('78.183.106.140', 52854)
05:43:51.584 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('173.28.194.176', 55229)
05:43:51.586 full_node full_node_server        : INFO     -> request_unfinished_block to peer ('174.89.217.214', 8444)
05:43:52.074 harvester harvester_server        : INFO     Received EOF from ('127.0.0.1', 8447), closing connection.
05:43:52.078 daemon src.daemon.server          : INFO     Websocket exception. Closing websocket with chia_farmer code = 1006 (connection closed abnormally [internal]), no reason Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 827, in transfer_data
    message = await self.read_message()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 895, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 971, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 1047, in read_frame
    frame = await Frame.read(
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/framing.py", line 105, in read
    data = await reader(2)
  File "/usr/lib/python3.8/asyncio/streams.py", line 721, in readexactly
    raise exceptions.IncompleteReadError(incomplete, n)
asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of 2 expected bytes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/src/daemon/server.py", line 132, in safe_handle
    async for message in websocket:
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 439, in __aiter__
    yield await self.recv()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 509, in recv
    await self.ensure_open()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 803, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: code = 1006 (connection closed abnormally [internal]), no reason

05:43:52.280 full_node src.full_node.full_node : INFO     Block is slow, expected 10717 seconds, waiting
05:43:52.280 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:52.281 full_node src.full_node.full_node : INFO     VDF will finish too late 47324 seconds, so don't propagate
05:43:52.282 full_node src.full_node.full_node : INFO     VDF will finish too late 9565 seconds, so don't propagate
05:43:52.283 full_node full_node_server        : INFO     Received EOF from ('127.0.0.1', 38466), closing connection.

Same error message on the second machine, but looks like it happened a couple hours earlier on that machine at 03:33:42.197 with the New best unfinished block at height 16125.

mmulsow commented 3 years ago

Found this in dmesg on both machines:

Machine 1:

[309239.876342] show_signal_msg: 18 callbacks suppressed
[309239.876345] chia_farmer[45874]: segfault at 0 ip 00007ffb795f3293 sp 00007ffd5307ab00 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7ffb795a1000+106000]
[309239.876354] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed
[302029.128625] chia_farmer[4407]: segfault at 0 ip 00007f9903ec58b8 sp 00007ffcabd63190 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f9903e73000+106000]
[302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 <c6> 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00
ToovaTech commented 3 years ago

I've had the same farming failure described in the issue. I had 3 machines on b15. Machine A had full stack, Machine B had full stack, Machine C (pi) had just harvester pointed to Machine B. This setup ran perfectly for a long time. I upgraded to b17 all 3 machines, and wanted to check full stack on the pi, so had the full stack running on the pi. My config for the pi however was still pointing to Machine B. Since then I've had several farming failures. I've shut down the pi farmer, and have 2 machines running full stack, and will monitor for failures.

ghost commented 3 years ago

Same issue with my setup. 1 Farmer (running GUI) Multiple Harvesters All on B17

dmesg output from the farmer: These are the last few entries in dmesg: [182826.347682] chia_farmer[496215]: segfault at 0 ip 00007f7580a03293 sp 00007fff05031bc0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f75809b1000+106000] [182826.347693] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8 [193835.893856] chia_farmer[719983]: segfault at 0 ip 00007fba71efb293 sp 00007ffcd6c09250 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fba71ea9000+106000] [193835.893863] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8 [212038.981592] perf: interrupt took too long (4901 > 4900), lowering kernel.perf_event_max_sample_rate to 40750 [239036.993418] chia_farmer[939247]: segfault at 0 ip 00007faafcbb3293 sp 00007ffff114d5d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7faafcb61000+106000] [239036.993424] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

ghost commented 3 years ago

I ran "chia start farmer" on the latest run, instead of using the GUI. Same issue.

Not sure if it has any value, but these are the chia processes still running & defunct

(venv) user1@farmer01:~$ ps -ef |grep chia user1 1162542 1 0 13:15 pts/2 00:00:06 /home/user1/chia-blockchain/venv/bin/python3.8 /home/user1/chia-blockchain/venv/bin/chia run_daemon user1 1162665 1162542 0 13:15 pts/2 00:01:02 chia_harvester user1 1162666 1162542 10 13:15 pts/2 00:28:08 [chia_farmer] <defunct> user1 1162667 1162542 6 13:15 pts/2 00:18:13 chia_full_node user1 1162668 1162542 0 13:15 pts/2 00:00:43 chia_wallet

wjblanke commented 3 years ago

nm -D -n blspy.cpython-38-x86_64-linux-gnu.so with offset 0x52293 says its in

00000000000520f0 T _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj

gdb blspy.cpython-38-x86_64-linux-gnu.so

Dump of assembler code for function _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj: DeriveChildSk 0x00000000000520f0 <+0>: push %r15 0x00000000000520f2 <+2>: push %r14 0x00000000000520f4 <+4>: push %r13 0x00000000000520f6 <+6>: push %r12 0x00000000000520f8 <+8>: mov %edx,%r12d 0x00000000000520fb <+11>: push %rbp 0x00000000000520fc <+12>: mov %rsi,%rbp 0x00000000000520ff <+15>: bswap %r12d 0x0000000000052102 <+18>: push %rbx 0x0000000000052103 <+19>: sub $0x1a8,%rsp 0x000000000005210a <+26>: mov 0x2b4dd7(%rip),%rbx # 0x306ee8 0x0000000000052111 <+33>: mov %rdi,0x38(%rsp) 0x0000000000052116 <+38>: mov $0x20,%edi 0x000000000005211b <+43>: callq (%rbx) 0x000000000005211d <+45>: mov $0x4,%edi 0x0000000000052122 <+50>: mov %rax,0x40(%rsp) 0x0000000000052127 <+55>: callq (%rbx) 0x0000000000052129 <+57>: mov $0x20,%edi 0x000000000005212e <+62>: mov %rax,%r15 0x0000000000052131 <+65>: mov %rax,(%rsp) 0x0000000000052135 <+69>: callq (%rbx) 0x0000000000052137 <+71>: mov $0x20,%edi 0x000000000005213c <+76>: mov %rax,%r13 0x000000000005213f <+79>: callq (%rbx) 0x0000000000052141 <+81>: mov $0x1fe0,%edi 0x0000000000052146 <+86>: mov %rax,%r14 0x0000000000052149 <+89>: mov %rax,0x10(%rsp) 0x000000000005214e <+94>: callq (%rbx) 0x0000000000052150 <+96>: mov $0x1fe0,%edi 0x0000000000052155 <+101>: mov %rax,0x18(%rsp) 0x000000000005215a <+106>: callq (%rbx) 0x000000000005215c <+108>: mov %r12d,(%r15) 0x000000000005215f <+111>: mov %r13,%rsi 0x0000000000052162 <+114>: mov %rbp,%rdi 0x0000000000052165 <+117>: mov %rax,0x20(%rsp) 0x000000000005216a <+122>: callq 0x25850 _ZNK3bls10PrivateKey9SerializeEPh@plt 0x000000000005216f <+127>: lea 0x10(%r14),%rax 0x0000000000052173 <+131>: cmp %rax,%r13 0x0000000000052176 <+134>: jae 0x52185 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+149> 0x0000000000052178 <+136>: lea 0x10(%r13),%rax 0x000000000005217c <+140>: cmp %rax,%r14 0x000000000005217f <+143>: jb 0x52a88 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2456> 0x0000000000052185 <+149>: movdqu 0x0(%r13),%xmm1 0x000000000005218b <+155>: pcmpeqd %xmm0,%xmm0 0x000000000005218f <+159>: mov 0x10(%rsp),%rax 0x0000000000052194 <+164>: pxor %xmm0,%xmm1 0x0000000000052198 <+168>: movups %xmm1,(%rax) 0x000000000005219b <+171>: movdqu 0x10(%r13),%xmm3 0x00000000000521a1 <+177>: pxor %xmm3,%xmm0 0x00000000000521a5 <+181>: movups %xmm0,0x10(%rax) 0x00000000000521a9 <+185>: mov $0x20,%edi 0x00000000000521ae <+190>: mov $0x1,%ebp 0x00000000000521b3 <+195>: callq (%rbx) 0x00000000000521b5 <+197>: mov (%rsp),%rcx 0x00000000000521b9 <+201>: mov $0x20,%edx 0x00000000000521be <+206>: mov %r13,%rsi 0x00000000000521c1 <+209>: mov $0x4,%r8d 0x00000000000521c7 <+215>: mov %rax,%rdi 0x00000000000521ca <+218>: mov %rax,%r14 0x00000000000521cd <+221>: callq 0x26310 md_hmac@plt 0x00000000000521d2 <+226>: mov $0x20,%edi 0x00000000000521d7 <+231>: callq (%rbx) 0x00000000000521d9 <+233>: mov $0x1,%edi 0x00000000000521de <+238>: mov %rax,%r15 0x00000000000521e1 <+241>: callq (%rbx) 0x00000000000521e3 <+243>: mov $0x21,%edi 0x00000000000521e8 <+248>: mov %rax,0x30(%rsp) 0x00000000000521ed <+253>: callq (%rbx) 0x00000000000521ef <+255>: mov %r13,0x50(%rsp) 0x00000000000521f4 <+260>: mov %rax,%r12 0x00000000000521f7 <+263>: xor %eax,%eax 0x00000000000521f9 <+265>: mov %r14,0x28(%rsp) 0x00000000000521fe <+270>: mov %rbp,%r14 0x0000000000052201 <+273>: mov %r12,%r13 0x0000000000052204 <+276>: mov %rax,%rdx 0x0000000000052207 <+279>: mov %r15,%r12 0x000000000005220a <+282>: jmp 0x52245 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+341> 0x000000000005220c <+284>: nopl 0x0(%rax) 0x0000000000052210 <+288>: and $0x4,%r15d 0x0000000000052214 <+292>: jne 0x52a70 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2432> 0x000000000005221a <+298>: test %edx,%edx 0x000000000005221c <+300>: je 0x5222f <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+319> 0x000000000005221e <+302>: movzbl (%r12),%ecx 0x0000000000052223 <+307>: mov %cl,0x0(%rbp) 0x0000000000052226 <+310>: test $0x2,%dl 0x0000000000052229 <+313>: jne 0x52ac8 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2520> 0x000000000005222f <+319>: cmp $0xff,%r14 0x0000000000052236 <+326>: je 0x52315 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+549> 0x000000000005223c <+332>: mov 0x48(%rsp),%rdx 0x0000000000052241 <+337>: add $0x1,%r14 0x0000000000052245 <+341>: mov $0x1fe0,%r15d 0x000000000005224b <+347>: mov $0x20,%ecx 0x0000000000052250 <+352>: sub %rdx,%r15 0x0000000000052253 <+355>: cmp $0x20,%r15 0x0000000000052257 <+359>: cmova %rcx,%r15 0x000000000005225b <+363>: lea (%r15,%rdx,1),%rax 0x000000000005225f <+367>: mov %rax,0x48(%rsp) 0x0000000000052264 <+372>: mov 0x18(%rsp),%rax 0x0000000000052269 <+377>: lea (%rax,%rdx,1),%rbp 0x000000000005226d <+381>: cmp $0x1,%r14 0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968> 0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx 0x000000000005227c <+396>: mov $0x21,%edx 0x0000000000052281 <+401>: mov %r13,%rsi 0x0000000000052284 <+404>: mov %r12,%rdi 0x0000000000052287 <+407>: movdqu (%r12),%xmm2 0x000000000005228d <+413>: mov $0x20,%r8d SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13) 0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3 0x000000000005229f <+431>: mov %r14b,0x20(%r13) 0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13) ***** md_hmac 0x00000000000522a8 <+440>: callq 0x26310 md_hmac@plt 0x00000000000522ad <+445>: mov %r15d,%edx 0x00000000000522b0 <+448>: cmp $0x8,%r15d 0x00000000000522b4 <+452>: jb 0x52210 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+288> 0x00000000000522ba <+458>: mov (%r12),%rdx

Guessed possible call stack, based on md_hmac call?

DeriveChildSk ParentSkToLamportPK IKMToLamportSk ExtractExpand Extract or Expand

static void Extract(uint8_t* prk_output, const uint8_t* salt, const size_t saltLen, const uint8_t* ikm, const size_t ikm_len) {
    // assert(saltLen == 4);  // Used for EIP2333 key derivation
    // assert(ikm_len == 32);  // Used for EIP2333 key derivation
    // Hash256 used as the hash function (sha256)
    // PRK Output is 32 bytes (HashLen)
    md_hmac(prk_output, ikm, ikm_len, salt, saltLen);
}

static void Expand(uint8_t* okm, size_t L, const uint8_t* prk, const uint8_t* info, const size_t infoLen) {
    assert(L <= 255 * HASH_LEN); // L <= 255 * HashLen
    assert(infoLen >= 0);
    size_t N = ceil((double)L / HASH_LEN);
    size_t bytesWritten = 0;

    uint8_t* T = Util::SecAlloc<uint8_t>(HASH_LEN);
    uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);
    uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

    assert(N >= 1 && N <= 255);

    for (size_t i = 1; i <= N; i++) {
        if (i == 1) {
            memcpy(hmacInput1, info, infoLen);
            hmacInput1[infoLen] = i;
            md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN);
        } else {
            memcpy(hmacInput, T, HASH_LEN);
            memcpy(hmacInput + HASH_LEN, info, infoLen);
            hmacInput[HASH_LEN + infoLen] = i;
            md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN);
        }
        size_t to_write = L - bytesWritten;
        if (to_write > HASH_LEN) {
            to_write = HASH_LEN;
        }
        assert (to_write > 0 && to_write <= HASH_LEN);
        memcpy(okm + bytesWritten, T, to_write);
        bytesWritten += to_write;
    }
    Util::SecFree(T);
    Util::SecFree(hmacInput1);
    Util::SecFree(hmacInput);
    assert(bytesWritten == L);
}

static void ExtractExpand(uint8_t* output, size_t outputLen,
                          const uint8_t* key, size_t keyLen,
                          const uint8_t* salt, size_t saltLen,
                          const uint8_t* info, size_t infoLen) {
    uint8_t* prk = Util::SecAlloc<uint8_t>(HASH_LEN);
    HKDF256::Extract(prk, salt, saltLen, key, keyLen);
    HKDF256::Expand(output, outputLen, prk, info, infoLen);
    Util::SecFree(prk);
}

Could that N double + ceil be causing mischief? Or maybe a sodium_malloc issue?

mov %r13,%rsi means movups %xmm2,0x0(%r13) is crashing on a reference to the second parameter to md_hmac which is ikm (passed in as key) or hmacInput1/hmacInput (which could have problems if N were wrong).

Maybe memcpy is having issues with the zero length info from here

    const uint8_t info[1] = {0};
    HKDF256::ExtractExpand(outputLamportSk, 32 * 255, ikm, ikmLen, salt, saltLen, info, 0);

All just guesses

wjblanke commented 3 years ago

0x000000000005226d <+381>: cmp $0x1,%r14 // if (i == 1) 0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968> if (i == 1) so i !=1 } else { memcpy(hmacInput, T, HASH_LEN); memcpy(hmacInput + HASH_LEN, info, infoLen); hmacInput[HASH_LEN + infoLen] = i; md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN); } 0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx // prk (md_hmac param - RCX) 0x000000000005227c <+396>: mov $0x21,%edx 0x0000000000052281 <+401>: mov %r13,%rsi // hmacInput (md_hmac param - RSI) 0x0000000000052284 <+404>: mov %r12,%rdi // T (md_hmac param - RDI) 0x0000000000052287 <+407>: movdqu (%r12),%xmm2 // Move first 16 bytes of T to xmm2 0x000000000005228d <+413>: mov $0x20,%r8d // 32 (HASHLEN md_hmac param = R8D) SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13) // Write first 16 bytes of T to hmacInput 0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3 // Move second 16 bytes of T to xmm3 0x000000000005229f <+431>: mov %r14b,0x20(%r13) // hmacInput[HASH_LEN + infoLen] = i; 0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13) // Write second 16 bytes of T to hmacInput md_hmac 0x00000000000522a8 <+440>: callq 0x26310 md_hmac@plt

It isn't happy copying T to hmacInput and fails on the first write. Since hmacInput allocates 33 bytes using sodium_malloc, the memory will be unaligned. Memcpy shouldn't care about that though (it is using unaligned movups). infoLen = 0 has been optimized out.

uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

hmmm. this must be failing. maybe we are running into this:

https://github.com/jedisct1/libsodium/issues/901

it might be worth trying setrlimit

wjblanke commented 3 years ago

OK a different spot from above. This crashed at 0x528B8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed [302029.128625] chia_farmer[4407]: segfault at 0 ip 00007f9903ec58b8 sp 00007ffcabd63190 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f9903e73000+106000] [302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00

above branched to <+1968> if i == 1 { memcpy(hmacInput1, info, infoLen); // optimized out hmacInput1[infoLen] = i; md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN); } 0x00000000000528a0 <+1968>: mov 0x30(%rsp),%rax 0x00000000000528a5 <+1973>: mov 0x28(%rsp),%rcx 0x00000000000528aa <+1978>: mov $0x1,%edx 0x00000000000528af <+1983>: mov %r12,%rdi 0x00000000000528b2 <+1986>: mov $0x20,%r8d **** SegFault 0x00000000000528b8 <+1992>: movb $0x1,(%rax) // hmacInput1[infoLen] = i; 0x00000000000528bb <+1995>: mov %rax,%rsi 0x00000000000528be <+1998>: callq 0x26310 md_hmac@plt

This is the i == 1 case. In this case it is trying to write into hmacInput1, somewhat the same. This fails?

uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);
arvidn commented 3 years ago

here are a few observations, just by lightly browsing the code at: https://github.com/Chia-Network/bls-signatures/blob/d1e8f892d1941ff38da08a85cf17fa2e40f4ea2a/src/hkdf.hpp#L46

wjblanke commented 3 years ago

I am now thinking this is caused by a sodium_malloc leak. Stock Ubuntu will only tolerate 15000 of these sodium leaks and it looks like BNWrapper may have a leak.

hoffmang9 commented 3 years ago

@wjblanke has a potential fix for the crashing farmer. Can folks try the farm-crash-test https://github.com/Chia-Network/chia-blockchain/tree/farm-crash-test branch?

`git fetch; git checkout farm-crash-test; git pull; sh install.sh' etc...

I can make a Windows installer available too if anyone has seen this problem on Windows.

ghost commented 3 years ago

I ran the farm-crash-test branch on the farmer overnight, sill crashed. (Ran from 20:50 to 05:49) I only updated the Farmer, do I need to get the harvesters on the same branch?

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [18779.585991] show_signal_msg: 28 callbacks suppressed [18779.585993] chia_farmer[3537]: segfault at 0 ip 00007f8e5e207293 sp 00007ffca4505760 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f8e5e1b5000+106000] [18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8 [20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000 [36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500 [193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice" [214122.784870] chia_farmer[239625]: segfault at 0 ip 00007fbd423f1503 sp 00007fff8a01b2d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fbd4239f000+106000] [214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

hoffmang9 commented 3 years ago

You actually need to update the bls library - blspy is it's python name. The probably fixed version is 0.2.5 - https://pypi.org/project/blspy/#history

I modified setup.py in the branch above to pull that in when you run install.sh. If you wait like 20 minutes I'll have one more possible fix in a version 0.2.6 of blspy.

hoffmang9 commented 3 years ago

Ok, I've updated farm-crash-test branch with the newest blspy of 0.2.6. Please pull and run that one.

ghost commented 3 years ago

I have been running the new version since 11:30 this morning (6.5 hours ago) and no crash yet, I will let it run through the night and report back tomorrow morning

ghost commented 3 years ago

Unfortunately crashed again. Let me know if I need to post any logs/dmesg output

ToovaTech commented 3 years ago

Plots winding down now, will try out farm-crash-test today. One farmer did fail on regular b17 again, this time with just two plotters running full stack.

etemiz commented 3 years ago

beta-1.0b18.dev19 (0.2.5?) crashed after 24 hours of running.

wjblanke commented 3 years ago

Yes please post the dmesg crash info

ghost commented 3 years ago

This is the dmesg of my latest crash:

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [18779.585991] show_signal_msg: 28 callbacks suppressed [18779.585993] chia_farmer[3537]: segfault at 0 ip 00007f8e5e207293 sp 00007ffca4505760 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f8e5e1b5000+106000] [18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8 [20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000 [36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500 [193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice" [214122.784870] chia_farmer[239625]: segfault at 0 ip 00007fbd423f1503 sp 00007fff8a01b2d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fbd4239f000+106000] [214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8 [225548.021245] perf: interrupt took too long (4951 > 4948), lowering kernel.perf_event_max_sample_rate to 40250 [258089.360167] chia_farmer[472880]: segfault at 0 ip 00007f61f9cf3dc3 sp 00007fffd21f0dc0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f61f9ca1000+107000] [258089.360173] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

wjblanke commented 3 years ago

hmm same spot. we've found another leak, more info soon. i wonder why it picks this particular spot (well, and one other we know about) to crash though.

eFishCent commented 3 years ago

I also get the same kind of crash with chia start harvester:

[141362.172271] chia_harvester[89830]: segfault at 0 ip 00007f4e20523293 sp 00007ffde54b62d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f4e204d1000+106000]
[141362.172281] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
hoffmang9 commented 3 years ago

I've just pushed a potentially fixed blspy. Please git pull the farm-crash-test branch and run it as you can.

ghost commented 3 years ago

Apologies for the delay in reporting back, I ended up reinstalling Linux on three of the harvesters to sort our the Python issues (These were upgrades from Ubuntu 19.x to 20.04)

I have the farm-crash-test branch running on my Farmer and all the harvesters for the last 9 hours and still going without any errors. (Typically crashed +-6 hours in) I will keep an eye on it during our day.

ToovaTech commented 3 years ago

I've got two machines running strong on dev20 farm-crash-test for 3+ days. I've just installed dev21 on a third machine that had only been plotting.

ghost commented 3 years ago

Nearly 24 hours on the latest farm-crash-test branch and everything is still working fine. I ran the Farmer initially from the command line and then via the GUI, both working fine.

etemiz commented 3 years ago

1.0b18.dev21 blspy 0.2.7 running for 47 hours without a problem.

hoffmang9 commented 3 years ago

I merged this into the dev branch. If anyone comes new to this issue, that's the branch you now need and not farm-crash-test branch. Feel free to keep running if you have the most recent farm-crash-test however.

I'm going to leave this open for another couple of days to make sure we don't see any more of this over longer time periods.

eFishCent commented 3 years ago

I have been running farm-crash-test for 5 days straight with no seg fault on the harvester. Peak memory usage is 983MB for me, and current usage is 928MB. What is everyone seeing as peak and current memory usage for their harvester?

hoffmang9 commented 3 years ago

Having heard no further reports of failure, I'm closing this as fixed for the next release.

sonpython commented 3 years ago

hi

my version 1.1.7

fresh install on ubuntu without GUI, I run harvester only to farm to a remote farmer.

It crashes every couple hours. When I check dmesg got this messages:

[T7 Thg 6 19 06:07:37 2021] chia_harvester[1315021]: segfault at 7f034413b1f8 ip 00007f03a7555922 sp 00007f03526ae7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f03a74e8000+b8000]
[T7 Thg 6 19 06:07:37 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[T7 Thg 6 19 15:33:38 2021] chia_harvester[1498469]: segfault at 7fa4741a9a17 ip 00007fa49ef0a922 sp 00007fa48e7ee7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7fa49ee9d000+b8000]
[T7 Thg 6 19 15:33:38 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[T7 Thg 6 19 20:21:22 2021] chia_harvester[1583997]: segfault at 7f76c0137d47 ip 00007f772c483922 sp 00007f76d9fed7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f772c416000+b8000]
[T7 Thg 6 19 20:21:22 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[CN Thg 6 20 00:34:44 2021] chia_harvester[1631373]: segfault at 7f50a81a4e25 ip 00007f50d40c4922 sp 00007f50d0d7c7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f50d4057000+b8000]
[CN Thg 6 20 00:34:44 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
hoffmang9 commented 3 years ago

Almost always a bad plot.

arvidn commented 3 years ago

our harvester should tolerate bad plots though..

hoffmang9 commented 3 years ago

Indeed it should. @sonpython could you figure out which plot is bad and upload it to us?

sonpython commented 3 years ago

@hoffmang9

sorry for late repy.

It took me a while couple hours to update 100Gb to Google Drive https://drive.google.com/file/d/1N5ehFWhv-cRibJpN90y6r5sh4jmfd72r/view?usp=sharing

There is only 1 bad plot as the result of chia plots check. But after I rename it to .plot.bad the harvester still crash and report segfault to dmesg

[CN Thg 6 20 10:11:00 2021] chia_harvester[1747413]: segfault at 7f16a80d07ee ip 00007f16d578f922 sp 00007f168fff17f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f16d5722000+b8000]
[CN Thg 6 20 10:11:00 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[CN Thg 6 20 15:42:43 2021] chia_harvester[1768453]: segfault at 7f3de810e403 ip 00007f3e7db21922 sp 00007f3dfe7ee7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f3e7dab4000+b8000]
[CN Thg 6 20 15:42:43 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d