[Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit

chain-enterprises commented 1 year ago

What happened?

When the system (ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots) hits a high IO load on the same block device as the Chia Full Node DB, shortly after the debug.log in chia will show GRResult not ok. The number of plots, lookup times, all seems fine - but the harvester stops finding proofs until the harvester is restarted. Happens 1-2 times in a 24 hour period on Alpha 4 through Alpha 4.3

Whenever error occurs, block validation time and lookup time consistently increase leading up to the error being thrown.

Reproducible with Nvidia Unix GPU Driver versions 530.30.03, 530.41.03, and 535.43.02

Version

2.0.0b3.dev56

What platform are you using?

Ubuntu 22.04 Linux Kernel 5.15.0-73-generic ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots

What ui mode are you using?

CLI

Relevant log output

023-05-29T20:45:32.552 full_node chia.full_node.mempool_manager: WARNING  pre_validate_spendbundle took 2.0414 seconds for xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2023-05-29T20:45:42.620 full_node chia.full_node.mempool_manager: WARNING  add_spendbundle xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx took 10.06 seconds. Cost: 2924758101 (26.589% of max block cost)
2023-05-29T20:45:56.840 full_node chia.full_node.full_node: WARNING  Block validation time: 2.82 seconds, pre_validation time: 2.81 seconds, cost: None header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732042
2023-05-29T20:46:57.239 full_node chia.full_node.full_node: WARNING  Block validation time: 3.34 seconds, pre_validation time: 0.42 seconds, cost: 3165259860, percent full: 28.775% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732044
2023-05-29T20:49:26.913 full_node chia.full_node.full_node: WARNING  Block validation time: 2.40 seconds, pre_validation time: 0.49 seconds, cost: 2041855544, percent full: 18.562% header_hash: 8d0ce076a3270a0c8c9c8d1f0e73c9b5b884618ee34020d2a4f3ffafa459cfd0 height: 3732055
2023-05-29T20:51:06.259 full_node full_node_server        : WARNING  Banning 89.58.33.71 for 10 seconds
2023-05-29T20:51:06.260 full_node full_node_server        : WARNING  Invalid handshake with peer. Maybe the peer is running old software.
2023-05-29T20:51:27.986 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/chia/hdd23/plot-k32-c02-2023-04-23-someplot.plot. GRResult is not GRResult_OK.
2023-05-29T20:51:28.025 harvester chia.harvester.harvester: ERROR    File: /media/chia/hdd23/someplot.plot Plot ID: someplotID, challenge: 7b5b6f11ec2a86a7298cb55b7db8a016a775efea221104b37905366b49f2e2bd, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7f3544998f30>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: contractHash>, plot_public_key=<G1Element PlotPubKey>, file_size=92374601728, time_modified=1682261996.8218756)
2023-05-29T20:51:57.482 full_node chia.full_node.full_node: WARNING  Block validation time: 10.23 seconds, pre_validation time: 0.29 seconds, cost: 959315244, percent full: 8.721% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732059
2023-05-29T20:55:24.640 full_node chia.full_node.full_node: WARNING  Block validation time: 3.18 seconds, pre_validation time: 0.26 seconds, cost: 2282149756, percent full: 20.747% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732067
2023-05-29T20:56:01.825 wallet wallet_server              : WARNING  Banning 95.54.100.118 for 10 seconds
2023-05-29T20:56:01.827 wallet wallet_server              : ERROR    Exception Invalid version: '1.6.2-sweet', exception Stack: Traceback (most recent call last):
  File "chia/server/server.py", line 483, in start_client
  File "chia/server/ws_connection.py", line 222, in perform_handshake
  File "packaging/version.py", line 198, in __init__
packaging.version.InvalidVersion: Invalid version: '1.6.2-sweet'

chain-enterprises commented 1 year ago

Still happening with GPU driver Linux Nvidia beta v535.43.02

As soon as the following GPU error occurred - the GRResult error was thrown in the chia debug.log

[Tue May 30 19:49:17 2023] NVRM: Xid (PCI:0000:08:00): 31, pid=459359, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f53_718af000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

which led to the debug log error

2023-05-30T19:49:18.260 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /media/chia/hdd142/plot-k32-c02-2023-04-26-06-20-xxxxxxxxxxxxx.plot. GRResult is not GRResult_OK

liyujcx commented 1 year ago

Me too...windows 10 ,gui,. gpu device :

error log like this: and no proofs any more....

ab0tj commented 1 year ago

Another "me too"

Dell R630 server, dual E5-2620v4 CPUs, 64GB RAM, Debian 11.7, Tesla P4 with 530.30.02 drivers.

reythia commented 1 year ago

Ref issue https://github.com/Chia-Network/chia-blockchain/issues/15470

This isn't limited to a few times a day. I switched to a pool to test proofs and got flooded with these errors with each partial until falling back to CPU harvesting.

jinglenode commented 1 year ago

Same issue here !

Ubuntu 22.04 / kernel 5.15.0-73 Driver Version: 530.30.02 CUDA Version: 12.1 Dual E5 2680 V4 / 256Gb 2133Mhz RAM / Tesla P4

Plots : C7 / around 9000

GRResult error in chia log + nvidia FAULT_PDE ACCESS_TYPE_READ in kernel log

Happens randomly, worse : 2hours, best : 20 hours without error.

prodchia commented 1 year ago

I am facing same GRResult issue. My details are:

Win 10. GTX 1060 with 535.98/CUDA 12.2 E5 2690V4/64GB RAM Currently 2428 C7 plots, and increasing. Using chia gui.

The issue has happened twice in last two days. Restarting the GUI fixed the issue.

thesemaphoreslim commented 1 year ago

I am able to consistently reproduce this error on a Windows-based system by using the Disable Device option in the display driver properties menu, waiting a few seconds, and enabling the device with the same button. The GRResult issue will then appear in the logs.

javanaut-de commented 1 year ago

I am also affected by this.

Running a distinct harvester (separated from full_node and farmer) on a BTC-T37 board with a Tesla P4 GPU and a LSI 9102 (SAS2116) HBA. Both HBA and GPU are attached via PCIe 1x gen2. Ubuntu 22.04 is running on a Celeron 1037U CPU with 4GB DDR3 RAM.

My harvester node is of version 2.0.0b3.dev116 bladebit alpha 4.3 obtained via the chia discord. Tried bladebit alpha 4.4 but this will not work at all. Farming 4280 C7 plots (bladebit) and some 300 non compressed NFT plots.

Edit: In my opinion this should produce an error message in the logs, maybe even critical, but not stopping the harvester to work.

github-actions[bot] commented 1 year ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

robcirrus commented 1 year ago

I periodically still getting a GRResult error GRResult is not GRResult_OK, received GRResult_OutOfMemory This was just GRResult is not GRResult_OK on alpha 4.5 No errors in Event Viewer stops sending partials chia start harvester -r resets and starts working ok Occurs about every 1-3 days.

Harvester only, no other activity on server. Alpha 4.6 (and had them on Alpha 4.5) NVidia Tesla P4, issues with drivers: 528.89, 536.25 HP Apollo 4200 Windows Server 2019 E5-2678v3, 64GB, all locally attached SAS,SATA drives. 3,434 C7 plots

Kinda leaving this box as is for testing this issue. Have other similar systems (> 20 harvesters) with A2000 6GB GPU with 4k - 15k mainly C5 plots and CPU compressed plots and no issues on them.

wjblanke commented 1 year ago

Can you try this with the release candidate. Let us know if you still see issues. Thanks

ericgr3gory commented 1 year ago

I am running rc1. I am getting GRR error. Debian 12 nvidia driver 535.86.05 with a Tesla p4 as harvester.

harold-b commented 1 year ago

Which GRResult specifically is it showing?

Synergy1900 commented 1 year ago

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

wallentx commented 1 year ago

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Synergy1900 commented 1 year ago

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Synergy1900 commented 1 year ago

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

kinomexanik commented 1 year ago

after replacing gpu(1070) with rtx2080ti i stopped getting GRResult errors

Synergy1900 commented 1 year ago

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

Same with RC6

jmhands commented 1 year ago

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

robcirrus commented 1 year ago

On my Windows Server 2019 Standard with Tesla P4 driver 536.25, E5-2697v3, 64GB ram. Just received latest ones earlier today, and it was giving 3 messages on same plot together. Have seen it report multiple consecutive errors sometimes, but not usually. No log items before/after indicating other issues.

Here's some log line entries before and after the 3 earlier today:

2023-08-21T15:25:58.561 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.51637 s. Total 3434 plots 2023-08-21T15:26:07.546 harvester chia.harvester.harvester: INFO 11 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.59371 s. Total 3434 plots 2023-08-21T15:26:15.999 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.34372 s. Total 3434 plots 2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 1.09760 s. Total 3434 plots 2023-08-21T15:26:36.080 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.35933 s. Total 3434 plots 2023-08-21T15:26:44.877 harvester chia.harvester.harvester: INFO 9 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.53123 s. Total 3434 plots

Nothing in the Application nor System event viewer, no errors, no warnings, nothing about NVidia drivers.

harold-b commented 1 year ago

Does the harvester log show any GRResult_Failed messages at any point?

Synergy1900 commented 1 year ago

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

Hi,

Found no messages in dmesg. Once it happens I'm getting the same message GRRResult is not GRResult_OK until I restart the harvester (chia start -r harvester). There are no other messages in the debug.log. After the upgrade to RC6 it worked for about a day before the first error occured again. Mostly it occurs randomly multiple times a day.

Regards S.

Ubuntu 22.04.2 LTS (256GB Memory) Nvidia Tesla P4 Driver Version: 535.86.10 CUDA Version: 12.2

kinomexanik commented 1 year ago

There is one theory. Check the debuglog file, is there an Invalid proof of space error? I used to have a gtx1070 and got a GRRResult error. Then I installed 2080ti and no more GRRResult error. But periodically now I get the error Invalid proof of space. After checking these plots, I see a bad plot. I mean, the GRRResult error may be due to the fact that there are bad plots.

kinomexanik commented 1 year ago

"2023-08-15T13:00:52.627 farmer chia.farmer.farmer : ERROR Invalid proof of space: b4107dc0d19ecbc636828695d4b65b44038770ae2e575c66ecf8472dd07ed142 proof:........" plots check -g b4107dc0d19ecbc636828695d4b65b44038770ae2e575c66ecf8472dd07ed142 and it will check the plot, and tell you where its located.

Synergy1900 commented 1 year ago

I already checked all my plots (multiple times), they are ok.

robcirrus commented 1 year ago

Does the harvester log show any GRResult_Failed messages at any point?

I'm not finding any of these in my logs.

Mine will typically go days before it gets the GRResult is not GRResult_OK, received GRResult_OutOfMemory error.

4ntibala commented 1 year ago

Same issue for me. Happens every 1-2 days :(

Linux Mint 21.2 - full node i7/32 GB NVIDIA GeForce GTX 1050 Ti - 4 GB Driver Version: 535.86.10 CUDA Version: 12.2 Plots: 226 TB Compression: C7 Chia Version: 2.0.0

robcirrus commented 1 year ago

I'm still getting the GRResult is not GRResult_OK, received GRResult_OutOfMemory on 2.0 release Installed 2.0 release several days ago. Getting it on the 2 almost identical systems with the Tesla P4, driver 536.25. Windows Server 2019, 2x E5-2650v4, 32GB, Tesla P4. Have 8 other similar systems with A2000 6GB and 23 with CPU compression harvesting, none of them have ever received the errors. About 70% C5 and 30% C7.

imba-pericia commented 1 year ago

Does the harvester log show any GRResult_Failed messages at any point?

Tested the GPU with compression and noticed that the error GRResult is not GRResult_OK, received GRResult_OutOfMemory appeared after the difficulty of the contract was switched. At the same time, the harvester and farmer continued to work with plots without compression. After restarting the harvester only and with the contract difficulty unchanged, everything works for more than a day, before switching the difficulty, it also farmed for about three days.

Ubuntu 22.10 P106-90 6gb Driver Version: 535.86.10 CUDA Version: 12.2 Compression: C6 Chia Version: 2.0.0

imba-pericia commented 1 year ago

Could not repeat, probably coincided.

4ntibala commented 1 year ago

Does the harvester log show any GRResult_Failed messages at any point?

i just checked my log from the last crash. nope - that error message is def not in there.

janit commented 1 year ago

I also have this pop up all the time in the logs with an A4000 with 4132 C7 plots on Chia 2.0.1 on Ubuntu 22.04 with 535 drivers.

Zir0h commented 1 year ago

Same issue on driver 535.104.05, cuda 12.2, GTX 1050 2GB. nvtop memory looks ok, restarting the harvester fixes it temporarily. 660 C7 plots. chia 2.0.1, ubuntu 22

Zir0h commented 1 year ago

I read on the discord to disable nvidia-persistenced.service and this seems to have worked in my case. I used to have problems every couple of hours, and now I'm over a day without issues.

//edit: just as I posted that, it happened again :/

janit commented 1 year ago

I read on the discord to disable nvidia-persistenced.service and this seems to have worked in my case. I used to have problems every couple of hours, and now I'm over a day without issues.

//edit: just as I posted that, it happened again :/

Thanks for the update. Too bad that didn't fix it.

liutas21 commented 1 year ago

Same issue. Happens every 1-2 days

Windows 10 Pro - harvester i7 4770T/32GB RAM NVIDIA GeForce GTX 1050 Ti - 4 GB Driver Version: 31.0.15.3713 Plots: 22 TB Compression: C7 Chia Version: 2.0.0

Xeppo commented 1 year ago

I'm also having the same issue - generally hits every 4-8 hours of harvesting.

Debian 12 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 512GB ram Nvidia Tesla P4 8GB Driver Version: 535.104.05 Cuda Version: 12.2 Plots: 2511, C7 compression Chia Version: 2.0.1

Also, my lookup times are really bad - in the 10-15 second range. Not sure if it's related. I have another ~1000 plots that I'm not harvesting due to lookup times.

Edit: This is the result from the bladebit_cuda simulate command, which seems to indicate that it's not a hardware limitation:

~$ /opt/chia/bladebit/bladebit_cuda simulate /mnt/disk03/plot-k32-c07-[plot data].plot [Simulator for harvester farm capacity for K32 C7 plots] Random seed: 0xdccf37faf2445399ee39c32023204e69c38e0d1bdc31ad87536e4071fd856267 Simulating...

Context count : 1 Thread per context instance : 0 Memory used : 333.6MiB ( 0.3GiB ) Proofs / Challenges : 90 / 100 ( 90.00% ) Fetches / Challenges : 59 / 100 Filter bits : 512 Effective partials : 2 ( 3.39% ) Total fetch time elapsed : 27.752 seconds Average plot lookup time : 0.470 seconds Worst plot lookup lookup time : 2.259 seconds Average full proof lookup time: 2.212 seconds Fastest full proof lookup time: 2.165 seconds

compression | plot count | size TB | size PB
C7 | 8707 | 729 | 0.73

Thiussen89 commented 1 year ago

Also getting GRR Error

Ubuntu 22.04.3 LTS CPU: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz 256 GB RAM Nvidia Tesla P4 8GB Driver: 535.54.03 Chia 2.0.0rc6 Cuda Version: 12.2 Plots 5000 C7 compression

\par \cf0\highlight0 2023-09-10T11:23:03.569 full_node chia.full_node.full_node: INFO \u9202?\u65039? Finished signage point 6/64: CC: 9f0dd6345a8fbbfe52668a3e1b9bce8c9912e479b3f6b87d2772b8aebe3b1fcf RC: eeff3a9adb9dd97a8f22de117b58b991b889b76eb052a2e5d664012db6b2bb83\cf1\highlight2 \par \cf0\highlight0 2023-09-10T11:23:04.249 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /home/administrator/Desktop/Mounts/QNAP_Disk06/plot-k32-c07-2023-07-20-13-17-86b66cdb52ef80976118ad6c8c727330755e75020d01c24fa2a290efbac7febb.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory\cf1\highlight2

NVRM: Xid (PCI:0000:81:00): 31, pid=2577920, name=chia, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f2a_c5d08000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Getting error also during chia plots check

GRResult is not GRResult_OK, received GRResult_OutOfMemory error in getting challenge qualities for plot /home/administrator/Desktop/Mounts/QNAP_Disk05/plot-k32-c07-2023-07-19-05-51-52ab45c097b75b8134e3df4e60423c33a615485149756517028c8d7d04d677c8.plot

imba-pericia commented 1 year ago

About two weeks of uptime, and now after 6-10 hours of work an error appears:

2023-09-16T20:47:30.129 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/d004/plot-k32-c06-2023-09-02-20-57-234f701b60b02086c831a36f66dea32119ac4bd0f1d9a177f598d4517559dab1.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory

And here are all the error entries in dmesg:

[   71.306476] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.104.05  Sat Aug 19 01:15:15 UTC 2023
[96827.411070] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[96827.411076] NVRM: Xid (PCI:0000:0d:00): 31, pid=5876, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f0d_03f6c000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[823379.150257] NVRM: Xid (PCI:0000:0d:00): 31, pid=2096, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f3c_f5c33000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[931750.200912] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[931750.200919] NVRM: Xid (PCI:0000:0d:00): 31, pid=27673, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f31_f4cce000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[975852.365567] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[975852.365574] NVRM: Xid (PCI:0000:0d:00): 31, pid=32135, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f82_8414a000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Ubuntu 22.10 (LXD container) P106-90 6gb Driver Version: 535.86.10 CUDA Version: 12.2 Compression: C6 Chia Version: 2.0.0

j4s0n commented 1 year ago

Also experiencing this.

Unraid 6.12.4 Machinaris 2.0.0 running via Docker with GPU pass through SuperMicro server with dual Xeon E5-2667 v2's and 256gb ECC DDR3 Nvidia GTX 1050 ti - driver v535.104.05

Memory usage on the GPU is typically low -- sub-20%. Overall system memory usage almost never breaks 50%. Restarting the Machinaris docker container seems to fix the problem.

As a temporary work-around, I run this in the background on my server -- it detects the problem and automatically restarts my Machinaris container, which seems to clear the problem. Results in a couple minutes of down time rather than an undefined amount:

tail -f /path/to/chia/mainnet/log/debug.log | awk '/GRResult_OutOfMemory/{ system("docker restart machinaris"); }'

imba-pericia commented 1 year ago

About two weeks of uptime, and now after 6-10 hours of work an error appears:

2023-09-16T20:47:30.129 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/d004/plot-k32-c06-2023-09-02-20-57-234f701b60b02086c831a36f66dea32119ac4bd0f1d9a177f598d4517559dab1.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory

And here are all the error entries in dmesg:

[   71.306476] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.104.05  Sat Aug 19 01:15:15 UTC 2023
[96827.411070] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[96827.411076] NVRM: Xid (PCI:0000:0d:00): 31, pid=5876, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f0d_03f6c000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[823379.150257] NVRM: Xid (PCI:0000:0d:00): 31, pid=2096, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f3c_f5c33000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[931750.200912] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[931750.200919] NVRM: Xid (PCI:0000:0d:00): 31, pid=27673, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f31_f4cce000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[975852.365567] NVRM: GPU at PCI:0000:0d:00: GPU-01faf8f4-70da-066e-c4f2-39adb67d671e
[975852.365574] NVRM: Xid (PCI:0000:0d:00): 31, pid=32135, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f82_8414a000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Ubuntu 22.10 (LXD container) P106-90 6gb Driver Version: 535.86.10 CUDA Version: 12.2 Compression: C6 Chia Version: 2.0.0

Now I thought maybe this is important. My farmer works in an lxd container, not a VM, the drivers are installed on the host, cudatoolkit and everything else are already inside the container. So, perhaps an important nuance, the error goes away when the container itself is restarted, the host continues to work and the system kernel, since in the case of the lxc container it is the host kernel. (on host suse leap)

jpabon77 commented 1 year ago

Getting same error every 2 days or so. I noticed most people have the Telsa P4 just the same as myself. I also noticed everyone who did have it had the same 535 CUDU driver. I thought this might be an issue with this driver however I tried a couple older ones and the newest one same issue or chia wouldn't even see the GPU. Running windows server with Telsa P4 Driver 537.13, 518.03,529.11,474.44 and Chia Client 2.0.1

kofttlcc commented 1 year ago

I have encountered this issue repeatedly with both the Tesla P4 and P100. This problem has persisted since the release of version 2.0, and it has not been resolved in either the official 2.0.1 release or the latest 2.1.0-RC3. My current workaround is to write a monitoring script in Python. Whenever the debug logs show this error, I immediately restart the program, which restores functionality. On average, this issue occurs at least 2-3 times per day. I have tried multiple GPU driver versions without success. My operating system version is Windows 10.

kofttlcc commented 1 year ago

And 2.1.0 release,The problem is still not solved

jpabon77 commented 1 year ago

Still seeing the issue every 2-3 days.

gml007 commented 1 year ago

Hi All, Im seeing the issue 4 - 6 times a day on three harvesters running Ubuntu 22.04 with nvidia driver for P4 card

Oct 15 23:48:08 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+

dmesg:

glamb@RDC-Har-002:~$ sudo dmesg | grep NVRM [ 17.603972] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.104.12 Wed Sep 20 09:35:17 UTC 2023 [65424.373562] NVRM: GPU at PCI:0000:d8:00: GPU-0827de19-dbbc-d2b3-fed8-ad549222ef80 [65424.373567] NVRM: GPU Board Serial Number: 0324417089914 [65424.373568] NVRM: Xid (PCI:0000:d8:00): 31, pid=23683, name=chia_harvester, Ch 00000009, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f90_27dc6000. Fault is of type FAULT_PDE ACCESS_TYPE_READ [71669.417612] NVRM: Xid (PCI:0000:d8:00): 31, pid=2782800, name=chia_harvester, Ch 00000009, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f5e_95adb000. Fault is of type FAULT_PDE ACCESS_TYPE_READ [86311.278442] NVRM: Xid (PCI:0000:d8:00): 31, pid=2931537, name=chia_harvester, Ch 00000009, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f32_90e75000. Fault is of type FAULT_PDE ACCESS_TYPE_READ [97291.651289] NVRM: Xid (PCI:0000:d8:00): 31, pid=3581040, name=chia_harvester, Ch 00000009, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f23_655fe000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Chia Debug Log:

jpabon77 commented 1 year ago

Im not sure if this is related but what i am seeing is the Stat_full_node.exe increasing over time.

hours later i am seeing this. While its not much of an increase it is still going up.
This is what my tesla reports after the error. Current makes sense as its basically not using the GPU anymore

Also this almost always happens right after an invalid proof of space error or shortly after. Ive checked the plots that produce the invalid and its always a good plot.

Evi5 commented 1 year ago

Get the same error on chia-blockchain2.1.1 Temporarily switch back to CPU

kinomexanik commented 1 year ago

Not long ago I received this error when I was checking plots. 1080ti

Chia-Network / chia-blockchain