[Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit

chain-enterprises commented 1 year ago

What happened?

When the system (ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots) hits a high IO load on the same block device as the Chia Full Node DB, shortly after the debug.log in chia will show GRResult not ok. The number of plots, lookup times, all seems fine - but the harvester stops finding proofs until the harvester is restarted. Happens 1-2 times in a 24 hour period on Alpha 4 through Alpha 4.3

Whenever error occurs, block validation time and lookup time consistently increase leading up to the error being thrown.

Reproducible with Nvidia Unix GPU Driver versions 530.30.03, 530.41.03, and 535.43.02

Version

2.0.0b3.dev56

What platform are you using?

Ubuntu 22.04 Linux Kernel 5.15.0-73-generic ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots

What ui mode are you using?

CLI

Relevant log output

023-05-29T20:45:32.552 full_node chia.full_node.mempool_manager: WARNING  pre_validate_spendbundle took 2.0414 seconds for xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2023-05-29T20:45:42.620 full_node chia.full_node.mempool_manager: WARNING  add_spendbundle xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx took 10.06 seconds. Cost: 2924758101 (26.589% of max block cost)
2023-05-29T20:45:56.840 full_node chia.full_node.full_node: WARNING  Block validation time: 2.82 seconds, pre_validation time: 2.81 seconds, cost: None header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732042
2023-05-29T20:46:57.239 full_node chia.full_node.full_node: WARNING  Block validation time: 3.34 seconds, pre_validation time: 0.42 seconds, cost: 3165259860, percent full: 28.775% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732044
2023-05-29T20:49:26.913 full_node chia.full_node.full_node: WARNING  Block validation time: 2.40 seconds, pre_validation time: 0.49 seconds, cost: 2041855544, percent full: 18.562% header_hash: 8d0ce076a3270a0c8c9c8d1f0e73c9b5b884618ee34020d2a4f3ffafa459cfd0 height: 3732055
2023-05-29T20:51:06.259 full_node full_node_server        : WARNING  Banning 89.58.33.71 for 10 seconds
2023-05-29T20:51:06.260 full_node full_node_server        : WARNING  Invalid handshake with peer. Maybe the peer is running old software.
2023-05-29T20:51:27.986 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/chia/hdd23/plot-k32-c02-2023-04-23-someplot.plot. GRResult is not GRResult_OK.
2023-05-29T20:51:28.025 harvester chia.harvester.harvester: ERROR    File: /media/chia/hdd23/someplot.plot Plot ID: someplotID, challenge: 7b5b6f11ec2a86a7298cb55b7db8a016a775efea221104b37905366b49f2e2bd, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7f3544998f30>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: contractHash>, plot_public_key=<G1Element PlotPubKey>, file_size=92374601728, time_modified=1682261996.8218756)
2023-05-29T20:51:57.482 full_node chia.full_node.full_node: WARNING  Block validation time: 10.23 seconds, pre_validation time: 0.29 seconds, cost: 959315244, percent full: 8.721% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732059
2023-05-29T20:55:24.640 full_node chia.full_node.full_node: WARNING  Block validation time: 3.18 seconds, pre_validation time: 0.26 seconds, cost: 2282149756, percent full: 20.747% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732067
2023-05-29T20:56:01.825 wallet wallet_server              : WARNING  Banning 95.54.100.118 for 10 seconds
2023-05-29T20:56:01.827 wallet wallet_server              : ERROR    Exception Invalid version: '1.6.2-sweet', exception Stack: Traceback (most recent call last):
  File "chia/server/server.py", line 483, in start_client
  File "chia/server/ws_connection.py", line 222, in perform_handshake
  File "packaging/version.py", line 198, in __init__
packaging.version.InvalidVersion: Invalid version: '1.6.2-sweet'

robajr commented 12 months ago

K2200 and M4000 -- both encountering this, on 2.1.1 In machinaris in docker - Ubuntu, GPUs passed through, 128GB RAM, dual broadwell Xeon

23-10-18T16:21:53.039 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /plots1/compressed/p4326/plot-k32-c05-2023-10-18-04-42-66d6ebdfc578624feb473459531f668337987c26acf71b62a0cfc99f94269710.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory

Played with multiple decompressor parameters in config.yaml, don't seem to change it

harold-b commented 12 months ago

Not long ago I received this error when I was checking plots. 1080ti Was your error GRResult_OutOfMemory as the user above?

harold-b commented 12 months ago

K2200 and M4000 -- both encountering this, on 2.1.1 In machinaris in docker - Ubuntu, GPUs passed through, 128GB RAM, dual broadwell Xeon

23-10-18T16:21:53.039 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /plots1/compressed/p4326/plot-k32-c05-2023-10-18-04-42-66d6ebdfc578624feb473459531f668337987c26acf71b62a0cfc99f94269710.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory

Played with multiple decompressor parameters in config.yaml, don't seem to change it

Seems strange that you got GRResult_OutOfMemory as the devices you quoted have plenty of memory. Less than 1G of memory ought to be used during farming, I believe, and it is allocated only once, unless the device was lost, in which case it can't recover unless the process is restarted.

I believe there's a configuration value to preallocate all memory at launch, is this enabled?

robcirrus commented 12 months ago

I have been tracking this for a while on 2 harvesters on Windows Server with Tesla P4. I have re-checked the plots on chia check plots and replaced all plots with percents < 80%. Both were getting the GRResult_OutOfMemory errors about every 1-4 days. One thing that I noticed lately was: One harvester I finished replotting to all C7 plots (3,512) last week. It had a combination of C5,C6,C7 before. This harvester has NOT had the GRR error since 10/17/2023, after all plots were only C7. Could it be that those having the GRR error are those that have mixed compression sizes? The other one has mixed compression sizes. i will start making all of them the same.

4ntibala commented 12 months ago

I have been tracking this for a while on 2 harvesters on Windows Server with Tesla P4. I have re-checked the plots on chia check plots and replaced all plots with percents < 80%. Both were getting the GRResult_OutOfMemory errors about every 1-4 days. One thing that I noticed lately was: One harvester I finished replotting to all C7 plots (3,512) last week. It had a combination of C5,C6,C7 before. This harvester has NOT had the GRR error since 10/17/2023, after all plots were only C7. Could it be that those having the GRR error are those that have mixed compression sizes? The other one has mixed compression sizes. i will start making all of them the same.

interesting observation, unfortunately i cannot confirm that though. i have solely C7 compressed plots and i do get the error 2-3/day.

larod commented 12 months ago

I can confirm this is till an issue, replaced the gpu and got the same error with a new gpu.

[ 708.309913] NVRM: GPU at PCI:0000:42:00: GPU-e379e91d-f194-1726-bd40-64192f4269d2 [ 708.309921] NVRM: Xid (PCI:0000:42:00): 31, pid=2770, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f54_f52ce000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

OS: Ubuntu 22.04.3 LTS x86_64 Kernel: 5.15.0-87-generic Memory: 5.60GiB / 314.81GiB Chia version: 2.1.1 NVIDIA Tesla P4 Driver Version: 535.104.12 CUDA Version: 12.2

For folks looking for a temporary fix: tail -n 0 -f /var/log/syslog | grep -E "Fault: ENGINE GRAPHICS GPCCLIENT" && chia start harvester -r

The command above will restart the chia harvester if the fault occurs, this way you can continue farming. This is a tricky bug because without looking at the logs it will look like your harvester is sending partials, in reality once this fault occurs your harvester will not find any partials but it looks like is submitting them.

jpabon77 commented 11 months ago

Moved to a 3060TI and have not had the issue once. It is for sure the Telsa P4 card. Also i noticed moving to the 3060TI i am getting more proofs with the same amount of plots.

Jahorse commented 11 months ago

I'm having the issue using a GeForce GTX 1070, it happened about 3 hours after switching from CPU to GPU harvesting for the first time. I just replotted over 600 TiB with this GPU without any problems.

dmesg error:

[195521.510258] NVRM: GPU at PCI:0000:01:00: GPU-267e684e-deec-6c64-7a9f-14a1280b91d2
[195521.510267] NVRM: Xid (PCI:0000:01:00): 31, pid=1062104, name=chia_harvester, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fcb_ef689000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

First error in Chia log:

2023-11-15T14:58:19.424 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /mnt/plotDrive39/compressed/plot-k32-c07-2023-10-06-00-58-xxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-11-15T14:58:19.425 harvester chia.harvester.harvester: ERROR    File: /.../plot-k32-c07-2023-10-06-00-58-xxx.plot Plot ID: xxx, challenge: xxx, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7fc73e35c3b0>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: xxx>, plot_public_key=<G1Element xxx>, file_size=83738042368, time_modified=1696572303.4059613)

OS: Ubuntu 22.04.3 LTS Kernel: 6.2.0-36-generic CPU: AMD Ryzen Threadripper 3960X RAM: 160 GB DDR4

GPU: GeForce GTX 1070 Nvidia Driver: 545.23.06 CUDA Version: 12.3

Chia version: 2.1.2.dev0 (built from source at 2.1.1 tag) Plots: 7966 C7, 12 C6, 12 C4

larod commented 11 months ago

I'm having the issue using a GeForce GTX 1070, it happened about 3 hours after switching from CPU to GPU harvesting for the first time. I just replotted over 600 TiB with this GPU without any problems.

dmesg error:
[195521.510258] NVRM: GPU at PCI:0000:01:00: GPU-267e684e-deec-6c64-7a9f-14a1280b91d2
[195521.510267] NVRM: Xid (PCI:0000:01:00): 31, pid=1062104, name=chia_harvester, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fcb_ef689000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
First error in Chia log:
2023-11-15T14:58:19.424 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /mnt/plotDrive39/compressed/plot-k32-c07-2023-10-06-00-58-xxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-11-15T14:58:19.425 harvester chia.harvester.harvester: ERROR    File: /.../plot-k32-c07-2023-10-06-00-58-xxx.plot Plot ID: xxx, challenge: xxx, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7fc73e35c3b0>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: xxx>, plot_public_key=<G1Element xxx>, file_size=83738042368, time_modified=1696572303.4059613)
OS: Ubuntu 22.04.3 LTS Kernel: 6.2.0-36-generic CPU: AMD Ryzen Threadripper 3960X RAM: 160 GB DDR4

GPU: GeForce GTX 1070 Nvidia Driver: 545.23.06 CUDA Version: 12.3

Chia version: 2.1.2.dev0 (built from source at 2.1.1 tag) Plots: 7966 C7, 12 C6, 12 C4

I have several farmers and have tried with several GPUs and every single one of them gets this error every couple of days. I wrote a script that monitor the logs and restarts the harvester everytime it sees that error, I run it like a service in Ubuntu, works like a charm. If you need it I'll share it.

Jahorse commented 11 months ago

Thanks for the offer but I wrote up a quick bash script right away to monitor the syslog file. I also wrote one to poll my GPU graphics every minute to see if I could catch any odd resource usage when the issue happened again. It happened again overnight and the error looked a little different in my syslog. I notice that the first time it happened it was GPCCLIENT_T 1_0 and this time GPCCLIENT_T1_1. Haven't looked up what that means, but the fault type is the same.

Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.006981] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504224=0x80000000 0x504228=0x0 0x50422c=0
x0 0x504234=0x0
Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.007010] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, NVRM: Graphics TEX Exception on (GPC 0, TPC 0):     TEX NACK / Page Fault
Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.007031] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504224=0x80000041 0x504228=0x180008 0x50422c=0xfdf298e0 0x504234=0x1fc0
Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.007134] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.007146] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504648=0x103000e 0x504650=0x0 0x504644=0xd3eff2 0x50464c=0x17f
Nov 15 22:43:47 jahorse-tr-3960x kernel: [223476.008288] NVRM: Xid (PCI:0000:01:00): 31, pid=1163452, name=chia_harvester, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7ffd_f298e000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

My GPU polling script didn't detect anything, it's using the gpustat program, which is just a wrapper around nvidia-smi. These are the log entries around the fault:

jahorse-tr-3960x            Wed Nov 15 22:42:44 2023  545.23.06
[0] NVIDIA GeForce GTX 1070 | 48°C,   0 % |   574 /  8192 MB | chia_harvester/1163452(514M) Xorg/12678(46M) gnome-shell/13020(8M)
jahorse-tr-3960x            Wed Nov 15 22:43:44 2023  545.23.06
[0] NVIDIA GeForce GTX 1070 | 49°C,   0 % |   574 /  8192 MB | chia_harvester/1163452(514M) Xorg/12678(46M) gnome-shell/13020(8M)
jahorse-tr-3960x            Wed Nov 15 22:44:44 2023  545.23.06
[0] NVIDIA GeForce GTX 1070 | 46°C,   0 % |   574 /  8192 MB | chia_harvester/1163452(514M) Xorg/12678(46M) gnome-shell/13020(8M)

It just continues on like that unchanged while the problem with the chia harvester persisted. My program to restart the chia harvester didn't work because it was looking for GPCCLIENT_T1_0 which didn't appear this time, so it spent quite a bit of time in the bad state.

Zealot88 commented 10 months ago

I'm having the issue using a GeForce GTX 1070, it happened about 3 hours after switching from CPU to GPU harvesting for the first time. I just replotted over 600 TiB with this GPU without any problems. dmesg error:
[195521.510258] NVRM: GPU at PCI:0000:01:00: GPU-267e684e-deec-6c64-7a9f-14a1280b91d2
[195521.510267] NVRM: Xid (PCI:0000:01:00): 31, pid=1062104, name=chia_harvester, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fcb_ef689000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
First error in Chia log:
2023-11-15T14:58:19.424 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /mnt/plotDrive39/compressed/plot-k32-c07-2023-10-06-00-58-xxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-11-15T14:58:19.425 harvester chia.harvester.harvester: ERROR    File: /.../plot-k32-c07-2023-10-06-00-58-xxx.plot Plot ID: xxx, challenge: xxx, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7fc73e35c3b0>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: xxx>, plot_public_key=<G1Element xxx>, file_size=83738042368, time_modified=1696572303.4059613)
OS: Ubuntu 22.04.3 LTS Kernel: 6.2.0-36-generic CPU: AMD Ryzen Threadripper 3960X RAM: 160 GB DDR4 GPU: GeForce GTX 1070 Nvidia Driver: 545.23.06 CUDA Version: 12.3 Chia version: 2.1.2.dev0 (built from source at 2.1.1 tag) Plots: 7966 C7, 12 C6, 12 C4
I have several farmers and have tried with several GPUs and every single one of them gets this error every couple of days. I wrote a script that monitor the logs and restarts the harvester everytime it sees that error, I run it like a service in Ubuntu, works like a charm. If you need it I'll share it.

Hi, I'm getting this on three different GPUs, 980 Ti, 1030, and a P4 8GB. Please share the script, I was working on something like this, but got stuck. Appreciate it. Using Ubuntu 22.04.

Zealot88 commented 10 months ago

I'm having the issue using a GeForce GTX 1070, it happened about 3 hours after switching from CPU to GPU harvesting for the first time. I just replotted over 600 TiB with this GPU without any problems. dmesg error:
[195521.510258] NVRM: GPU at PCI:0000:01:00: GPU-267e684e-deec-6c64-7a9f-14a1280b91d2
[195521.510267] NVRM: Xid (PCI:0000:01:00): 31, pid=1062104, name=chia_harvester, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fcb_ef689000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
First error in Chia log:
2023-11-15T14:58:19.424 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /mnt/plotDrive39/compressed/plot-k32-c07-2023-10-06-00-58-xxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-11-15T14:58:19.425 harvester chia.harvester.harvester: ERROR    File: /.../plot-k32-c07-2023-10-06-00-58-xxx.plot Plot ID: xxx, challenge: xxx, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7fc73e35c3b0>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: xxx>, plot_public_key=<G1Element xxx>, file_size=83738042368, time_modified=1696572303.4059613)
OS: Ubuntu 22.04.3 LTS Kernel: 6.2.0-36-generic CPU: AMD Ryzen Threadripper 3960X RAM: 160 GB DDR4 GPU: GeForce GTX 1070 Nvidia Driver: 545.23.06 CUDA Version: 12.3 Chia version: 2.1.2.dev0 (built from source at 2.1.1 tag) Plots: 7966 C7, 12 C6, 12 C4
I have several farmers and have tried with several GPUs and every single one of them gets this error every couple of days. I wrote a script that monitor the logs and restarts the harvester everytime it sees that error, I run it like a service in Ubuntu, works like a charm. If you need it I'll share it.

Hi @larod Would you mind sharing the script? I run Ubuntu 22.04.03 Thanks!

larod commented 10 months ago

Hi, You can run the script below on a screen or create a unit file to run it on the background. Hope this helps!

#!/bin/bash

# Define the log file path in your home directory
LOG_FILE="/root/start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="Fault: ENGINE GRAPHICS GPCCLIENT"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log the result in the log file, I'm also sending myself an email, but this is optional
    echo "$CURRENT_TIME Restarting Chia Harvester Service... [$RESTART_RESULT]" | mail -s "OUTLANDS: Chia Harvester Restarted" [youremail@gmail.com](mailto:youremail@gmail.com)
    echo "$CURRENT_TIME Restarting Chia Harvester Service... [$RESTART_RESULT]" >> "$LOG_FILE"

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done

Zealot88 commented 10 months ago

create a unit file to run it on the background

Thank you! Testing, on my P4 it doesn't happen that often and on the others too frequently so took them out and farm with CPU. Thanks again Luis.

RATTL3R316 commented 10 months ago

Same thing here can be days or minutes. Z840 dual e5-2699 v3, 512gb ddr4 with a 3060 ti. I can plot all day with this setup. But finish plotting and attempt to farm.CRASH! And as one person said this is a sneaky bug. It looks like things are running fine plots are passing filter, but no proofs. You go into the logs and this thing has stopped working 12 hours ago lol. Maddening. Is this even being looked at? I know they had a lay off. I guess gigahorse is going to be the answer?

Jahorse commented 10 months ago

I did try to follow through the error messages to see where it's coming from and it's definitely coming from Green Reaper, which is a part of bladebit used for harvesting compressed plots. It has been a few weeks since I looked, so I can't remember exactly where it came from, but GR allocates a specific amount of memory in your GPU for it to use and I believe this failure happens when a command GR tries to run throws a page fault in that memory. It seems odd to me it's throwing a page fault in reserved memory when the device likely has lots of free memory to use.

I tried to force GR to reserve extra memory in the GPU but I was unsuccessful. There are a lot of calculations going on in GR to determine how much memory to allocate and GR is written as a C++ package which is a bit removed from the Python code package that makes up the Chia harvester. One of the inputs to the required memory calculation is the maximum compression level allowed, so rather than trying to modify the logic in GR, I removed the hard-coded check to limit compression to level 7 and set it to level 8. My hope was that this would force GR to reserve more memory, but GR threw errors during start up.

I wonder if GR's ability to harvest plots with a compression level greater than 7 is the limitation keeping the max supported compression at 7. I also wonder if increasing that limitation might also solve this issue.

harold-b commented 10 months ago

Is this even being looked at?

It's certainly has not been forgotten. But I am currently a bit preoccupied with the CHIP 22 work. I'm hoping to be able to come back and look at it soon. I appreciate all the feedback

timolow commented 9 months ago

Same issue: Chia 2.1.4 Debian Bookworm Quadro P2000 Driver Version: 525.147.05 CUDA Version: 12.0 6.1.0-17-amd64

bryankr commented 9 months ago

Same issue: Chia 2.1.4 Ubuntu 22.04 NVIDIA P4 Driver Version: 535.146.02 CUDA Version: 12.2 Xeon E5-2620 v3 @ 2.40GHz HP Proliant ML150

And also getting another silent GPU fail now and again that only shows up in dmesg as: [273333.524849] NVRM: Xid (PCI:0000:03:00): 31, pid=166747, name=chia_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f6e_1ed8b000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Happens approx once per day on average, for both issues auto detect in logs/dmesg and restart harvester with a bash cron script.

spegelius commented 8 months ago

Happening randomly, mostly when running chia plots check but also during farming. Chia 2.1.4 Ubuntu 22.02 and Windows 10 Drivers 546.17 for Windows, 525.x and 535.x for Linux Multiple 1050 4G cards and one 1030 2G affected. Also two 3060 cards but those don't have any issues. Supprisingly there's no difference between 1030 and 1050 regards this crash...

spegelius commented 8 months ago

Well it seems that c07 plots are causing this so I replotted to level c05 and all is good.

Jahorse commented 8 months ago

Well it seems that c07 plots are causing this so I replotted to level c05 and all is good.

This kind of lines up with my suspicion that something is a bit off with the calculation for the required memory at C7. It would be nice if somebody who knew what they were doing could try to increase the allocations a bit.

timolow commented 8 months ago

I am using c05 and am experiencing this issue, might be a qty of plots?

spegelius commented 8 months ago

I am using c05 and am experiencing this issue, might be a qty of plots?

Hard to say. c07 doesn't seem to immediately cause this error to occur, one of my harvesters with 1050 GTX and around 50TB of c07 plots (with 50TB c05 mixed in) has seen multiple days without errors but can freak out in less than a day. Also running plot check on those plots might encounter this error in one run and pass with next one. So something random seems to be happening. Removing all c07 plots in my case seems to have fixed the situation, but could be that the odds of it happening are just much smaller... Also I was wondering if the amount of proofs found could affect this? And does the decompressor_thread_count-setting affect GPU decompressor?

4ntibala commented 8 months ago

dunno if this info is any helpful, but i had my C7 farm running on a old i7 for about a few months now. the GPU error occurred usually 1-2/day.. not much more.. sometimes even less. recently i changed to a workstation.. an old D30, running the same OS and the same GPU.

what changed is, that the GPU error now shows up multiple times per day. sometimes even 5 or 6 times.

interestingly, even though the error rate is now higher - i do see less stale partials.

this might be related to a chia client update - or not.. i dont know. i just thought to share this observation. same GPU, different fail rates.

Linux Mint 21.2 - full node Lenovo D30 Workstation 256 GB RAM NVIDIA GeForce GTX 1050 Ti - 4 GB Driver Version: 535.86.10 CUDA Version: 12.2 Compression: C7 Plots: 4534 Chia Version: 2.1.4

jeancur commented 8 months ago

Found this as well GRResult is not GRResult_OK

Repeats three time for for the same plot. then will occur on another plot some time later. None for days 04-Feb to 13-Feb, then a whack of then 14-Feb, then a few the next day, then nothing next two days. System, 1910 Threadripper, does nothing but farm.

Would this affect farming block wins/

OS: Ubuntu 20.04.3 LTS x86_64 Kernel: 5.15.0-94-generic Memory: 5.90GiB / 16.0 GiB Chia version: 2.1.4 farming only 10,000 C7 Plots NVIDIA M2000 4Gb, CUDA Version: 5.2 Driver Version: 535.154.05

djerfy commented 7 months ago

Same problem here (one time):

Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362629] NVRM: GPU at PCI:0000:01:00: GPU-863a3809-614b-4d4d-5f2c-3e071e56b7bb
Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362633] NVRM: GPU Board Serial Number: 0421619095918
Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362634] NVRM: Xid (PCI:0000:01:00): 31, pid=2905, name=chia_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7fe2_25685000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Thanks @larod for your script

OS: Ubuntu 22.04.4 LTS x86_64 Kernel: 5.15.0-97-generic Memory: 9.06GiB / 16.0 GiB Chia version: 2.1.4 farming only 2078 plots (C0 23%, C5 77%) NVIDIA TESLA P4 8Gb, CUDA Version: 12.2 Driver Version: 535.161.07

Nuke79 commented 7 months ago

Chia 2.2.0 is live. Can someone confirm, that bug is fixed? Didn't see anything about it in patch notes.

Proace1 commented 7 months ago

Version 2.2.0 does not do GPU mining. No matter which graphics card is used. I have the RTX 2080 Super, RTX 3060 or a GT 1030. In version 2.1.4 it works without any problems. The BUG is of course included.

Nuke79 commented 7 months ago

Version 2.2.0 does not do GPU mining. No matter which graphics card is used. I have the RTX 2080 Super, RTX 3060 or a GT 1030. In version 2.1.4 it works without any problems. The BUG is of course included.

2.2.0 GPU farming work fine for me. No bug still (about 3 hours) accured.

P.S. Bug still present. Same error on 2.2.0: GRResult is not GRResult_OK, received GRResult_OutOfMemory. Nvidia GTX 1070/8Gb vRAM.

GolDenis72 commented 6 months ago

The same problem 2024-03-22T01:42:57.016 harvester chia.harvester.harvester: ERROR Exception fetching full proof for xxxxxxxxxxxxxxxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory Every 3-6 hours ~6500 plots, Win 10, Nvidia GTX 1060/8Gb vRAM

mehditlili commented 6 months ago

How hard is it for devs to get a 1060 and reproduce the problem locally. Just blindly upgrading the version with unrelated fixes and asking people to test it for you is not very professional. Chia is opensource and that is nice and we are grateful, but you are paid and that is your job, so do it please.

spegelius commented 6 months ago

Converting c07 -> c05 didn't seem to help, one linux machine with 1050 4GB still occasionally has this problem. Interestingly Win10 machines with 1050 and 1030 are much more stable. I wonder if the env settings from Eth mining are affecting this (GPU_MAX_ALLOC_PERCENT etc...). I have those only on the Win machines.

GolDenis72 commented 6 months ago

"How hard is it for devs to get a 1060 and reproduce" Hmmmmm.... about doezn of (OLD!) VCU brands, a few windows adition. Who will care about that? Just instal restart script & forget. Regards.

Proace1 commented 6 months ago

@GolDenis72 what kind of script?

GolDenis72 commented 6 months ago

https://github.com/Chia-Network/chia-blockchain/issues/15404#issuecomment-1826150694 works fine for me. Even on Win 10 via Git. A few correction was made (e-mail sending for example) but mainly.... as original. p.s. LOG_MESSAGE="Fault: ENGINE GRAPHICS GPCCLIENT" change for our LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory" if that is found in the debug.log (give the real path to debug.log) - it will restart ONLY!!!! harvester! SUPERB usefull!!!! Found error - restart harvester (no any synch problem, lost of connection.... etc.) VERY nice!

Daivis88 commented 6 months ago

@larod @GolDenis72

#15404 (comment) works fine for me. Even on Win 10 via Git. A few correction was made (e-mail sending for example) but mainly.... as original. p.s. LOG_MESSAGE="Fault: ENGINE GRAPHICS GPCCLIENT" change for our LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory" if that is found in the debug.log (give the real path to debug.log) - it will restart ONLY!!!! harvester! SUPERB usefull!!!! Found error - restart harvester (no any synch problem, lost of connection.... etc.) VERY nice!

Hi. I'm very new to this. Maybe you could tell me where I made a mistake ?

$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:\Users\2010m\.chia\mainnet\log\debug.log"

# The specific log message to look for
LOG_MESSAGE="Fault: GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done

2024-04-15 20:41:21 Starting to monitor syslog... tail: cannot open '/var/log/syslog' for reading: No such file or directory

I have a sense that it's not working because I'm on windows 10 but then you said that you are using win 10 too. I really need this to work, my system needs restarting every 6h or so...

Please help.

GolDenis72 commented 6 months ago

Hi! Define the log file path in your home directory Means log file for script's output (in my case "LOG_FILE="c:/Users/denis/.chia/mainnet/log/start_harvester.log")

Use tail to monitor the log file and grep for the log message if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

Not working for Windows, so we need to put the real path to the chia log file if tail -n 0 -F c:/Users/denis/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then Good luck!

Daivis88 commented 6 months ago

Hi! Define the log file path in your home directory Means log file for script's output (in my case "LOG_FILE="c:/Users/denis/.chia/mainnet/log/start_harvester.log")

Use tail to monitor the log file and grep for the log message if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

Not working for Windows, so we need to put the real path to the chia log file if tail -n 0 -F c:/Users/denis/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then Good luck!

@GolDenis72 Hi again. I have tried it and it didint work.. it loks like this now.. Any ideas ?


2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:\Users\2010m\.chia\mainnet\log\Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="Fault: GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:\Users\2010m\.chia\mainnet\log\debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-19 23:40:15 Starting to monitor syslog...
tail: cannot open 'C:Users2010m.chiamainnetlogdebug.log' for reading: No such file or directory

GolDenis72 commented 6 months ago

your comp is right! Slash!!! "/" NOT back slash!!!! ""\"!!! Check your paths again! mine: c:/Users/denis/.chia/mainnet/log/debug.log yours: C:\Users\2010m.chia\mainnet\log\debug.log See the differences? => "/" NOT "\" Good luck!

Daivis88 commented 5 months ago

your comp is right! Slash!!! "/" NOT back slash!!!! ""\"!!! Check your paths again! mine: c:/Users/denis/.chia/mainnet/log/debug.log yours: C:\Users\2010m.chia\mainnet\log\debug.log See the differences? => "/" NOT "\" Good luck!

Oh my god.... how did I miss that.... I'm so grateful you have pointed it out to me. Looks like it's running now, will see if it works. Thanks, again.

Daivis88 commented 5 months ago

@GolDenis72 Hi again. :)

It looks that the script monitors the log but when found the error harvester didn't restart, says command not found...any ideas ?

2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:/Users/2010m/.chia/mainnet/log/Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:/Users/2010m/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-20 20:42:54 Starting to monitor syslog...
bash: chia: command not found
bash: logger: command not found
2024-04-21 18:01:36 Starting to monitor syslog...

GolDenis72 commented 5 months ago

And again. Your comp is right! command not found! :-) You are try to made direct script transfer from Linux to Windows. You should do that correctly. For example. "chia start harvester -r" means, that system know WHERE chia.exe is (path to the chia directory put before into the system paths) OR you are started that script from the chia.exe directory. If NOT - put the FULL PATH to the chia.exe in the script like (in my case) c:/Users/denis/AppData/Local/Programs/ChiaFox/resources/app.asar.unpacked/daemon/chia.exe start harvester -r Keep trying! Good luck!

Daivis88 commented 5 months ago

And again. Your comp is right! command not found! :-) You are try to made direct script transfer from Linux to Windows. You should do that correctly. For example. "chia start harvester -r" means, that system know WHERE chia.exe is (path to the chia directory put before into the system paths) OR you are started that script from the chia.exe directory. If NOT - put the FULL PATH to the chia.exe in the script like (in my case) c:/Users/denis/AppData/Local/Programs/ChiaFox/resources/app.asar.unpacked/daemon/chia.exe start harvester -r Keep trying! Good luck!

@GolDenis72 I did it and it came up with error again.... Please don't judge to harsh...


2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:/Users/2010m/.chia/mainnet/log/Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:/Users/2010m/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-24 12:10:12 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:33:07 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:42:07 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:45:39 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:48:50 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found

GolDenis72 commented 5 months ago

check that path twice " C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" I think something wrong with that

GolDenis72 commented 5 months ago

there are 2 chia execution files in the system (dont ask me why) you need find the right one. & just check it with simple chia command first (Like chia -h) to be sure, that you find the right one After, just put the FULL path to that exe file into the script m-m-m.... not sure about logger. Looks like I comment that line

Daivis88 commented 5 months ago

check that path twice " C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" I think something wrong with that

The path is right..... I'm thinking must be something that its two words Program files....

Daivis88 commented 5 months ago

@GolDenis72 Works finally... so the issue was the two words C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe As soon as I added these guys "C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" it found the path. So happy that it works, I really appreciate your help, thank you for the patience and help. I can sleep without a worry now :D

emlowe commented 3 months ago

Closing issue - at this time resources are being directed to a new plot format

larod commented 3 months ago

This is quite unfortunate, having waited a year and still a persisting problem. Wow.

Chia-Network / chia-blockchain