Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[BUG] one or several plotting process freezes #5552

Closed anufriu closed 3 years ago

anufriu commented 3 years ago

Describe the bug 1st plot stailing while plotting process at phase 3 (probably only on nvme drive) plot logs just stot on computing or smthin else and do nothing for 6+ hours but process still have a normal mem\cpu cunsumption but disk space on tmp dir on flat values. so you have to kill the plotting process and delet tmp file by your hands 2nd plot process fails with exception

Caught plotting error: Matches do not match with number of write entries 4293904004 4293903971
Traceback (most recent call last):
  File "/home/losb/chia-blockchain/venv/bin/chia", line 33, in <module>
    sys.exit(load_entry_point('chia-blockchain', 'console_scripts', 'chia')())
  File "/home/losb/chia-blockchain/chia/cmds/chia.py", line 77, in main
    cli()  # pylint: disable=no-value-for-parameter
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)                                                                                                                                                                                                                                      File "/home/losb/chia-blockchain/chia/cmds/plots.py", line 135, in create_cmd
    create_plots(Params(), ctx.obj["root_path"])
  File "/home/losb/chia-blockchain/chia/plotting/create_plots.py", line 176, in create_plots
    args.nobitfield,
RuntimeError: std::exception

and also you should search plot names in logs and del tmp files

To Reproduce Steps to reproduce the behavior:

  1. i have no idea
  2. same on win\linux machines

Desktop (please complete the following information):

Additional context its hard to detect when plottting process is permanent, but im locky to have a prom\grafana metrics to see it. so its possible than ssd tem dir will be full of this shit and stop working at all [and now with strong russian accent] this is pissed me off cos i alwase had to look in to metrics and clean that bad, garbage plots parts, it makes my bear very angry. and my plotts\per day is affected by this issue so stalina na vas net gulag po vam plachet. ебучие дилетанты таких вы вы по концлагерям гонять надо

anufriu commented 3 years ago

probably find the 2nd case proof at https://github.com/Chia-Network/chia-blockchain/issues/4989#issuecomment-841563610

anufriu commented 3 years ago

https://github.com/Chia-Network/chia-blockchain/issues/4118 and 1st issue too, ill tyr to memtest wthen i have a oppotynity

room101-dev commented 3 years ago

Caught plotting error: Matches do not match with number of write entries 4293904004 4293903971

As soon as you see that MESSAGE it means that the NVME for plotting is dead. This is nothing about RAM, the problem is that the random controller on the NVME has died, and at the very end of the plot the chia-pos compares these two tally's, if they don't match the chia-pos plotter closes and issues this message. It's too late for this NVME, your hosed, get another new NVME and begin again In the future run 'fstrim' frequently on your NVME, and monitor the temp and don't let the controller temp go above 90C

Problem is that there is NO monitoring code in chia-pos ( chia plot create .. -t -2 .. ), so it just keeps burning up the NVME until it dies, there needs to be a 'try' in the C++ to monitor the NVME failure, the data is there "Sudo nvme smart-log /dev/nvmeNNN", look at busy-controller, you will see that start sky-rocketing right at the time you get the infamous "mis-match" error

Reported months ago to DEV on this problem, they just blow it off as a RAM problem, NO its not, how did I find out the truth, I have three NVME's running on my machine, and I noted only one had the problem, that told me it wasn't a RAM problem.

room101-dev commented 3 years ago

Describe the bug 1st plot stailing while plotting process at phase 3 (probably only on nvme drive) plot logs just stot on computing or smthin else and do nothing for 6+ hours but process still have a normal mem\cpu cunsumption but disk space on tmp dir on flat values. so you have to kill the plotting process and delet tmp file by your hands 2nd plot process fails with exception Caught plotting error: Matches do not match with number of write entries 4293904004 4293903971 Traceback (most recent call last): File "/home/losb/chia-blockchain/venv/bin/chia", line 33, in sys.exit(load_entry_point('chia-blockchain', 'console_scripts', 'chia')()) File "/home/losb/chia-blockchain/chia/cmds/chia.py", line 77, in main cli() # pylint: disable=no-value-for-parameter File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke return callback(args, *kwargs) File "/home/losb/chia-blockchain/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func return f(get_current_context(), args, **kwargs) File "/home/losb/chia-blockchain/chia/cmds/plots.py", line 135, in create_cmd create_plots(Params(), ctx.obj["root_path"]) File "/home/losb/chia-blockchain/chia/plotting/create_plots.py", line 176, in create_plots args.nobitfield, RuntimeError: std::exception and also you should search plot names in logs and del tmp files

To Reproduce Steps to reproduce the behavior:

  1. i have no idea
  2. same on win\linux machines

Desktop (please complete the following information):

  • OS: [ubuntu server 20
  • OS Version/Flavor: [ kernel 5.4.0-73-generic]
  • CPU: [AMD Ryzen 7 3700X 8-Core Processor]

Additional context its hard to detect when plottting process is permanent, but im locky to have a prom\grafana metrics to see it. so its possible than ssd tem dir will be full of this shit and stop working at all [and now with strong russian accent] this is pissed me off cos i alwase had to look in to metrics and clean that bad, garbage plots parts, it makes my bear very angry. and my plotts\per day is affected by this issue so stalina na vas net gulag po vam plachet. ебучие дилетанты таких вы вы по концлагерям гонять надо

What can you monitor on your next new NVME, given you have destroyed this one.

Tools on linux.

Use "Watch sudo iostat", you should see 500MB/sec all the time, once it goes to zero, it means the NVME has quit writing, time to investigate.

USE "Sudo NVME smart-log /dev/nvmeXXX" to get the logs and monitor temp and critical stuff, I find a strong correlation with 'controller busy', also on this log there is TBW most drives have a warranty of 600, IMHO most of my NVMES die at 150TBW, or 25% of warranty, once your NVME gets near that amount, its time to re-deploy, and use it for a less critical task

Keep HTOP running and watch the cpu, another tell-tale on NVME failure is that CPU cycles will skyrocket and IOSTAT goes to zero, this is because the chia-post plotting software is shit, and has no diagnostic, or try loops to to end the process, once the NVME say's "I'm DONE I can't do this anymore" the plotting sw just keeps running, it doesn't care, nobody cares.

Only you should care, monitor the process automate with bash, and when the above hits you terminate that process. Don't kill your NVME drives just because the dev team doesn't give a hell about you

I'm so bored with this 'RAM" response we get from CHIA aka team-sw-management incompetence.

room101-dev commented 3 years ago

Here's the best advice for anybody with this problem "mis-match" ( its NOT ram problem mr hoffman I told you months ago )

USE linux, USE the cmd-line to deploy, to run more than 4 plots per NVME, 1TB or 2TB its the same controller, this is not a memory problem in the NAND, this is a controller problem, that was not designed to handle the devils abuse ( chia-pos )

LINUX is # 1 the tools are there

CMD-LINE is #1 all the GUI on chia-net is shit

Don't use any tool from 'plotman' or any of these so called management tools they'll kill your yaml files, and they''ll kill your NVME's, these guys who wrote these tools are just script kiddy's, automating things they didn't even test; One time I ran 'plotman interactive' and it brought an entire machine down, and even start plotting across the network(lan) they said it was "PASSIVE" bullshit;

it's like all these 3rd party & chia-team are working with samsung to KILL NVME's so that they can all raise prices 10x, just like GPU, there is no other explanation for this GROSS incompetence.

What is meant cmd-line linux plotting, you do "Chia plot create -t temp -2 temp -d final" you stagger say four by at least 30 minutes ( the time to pass phase 1, or write out plot phase 4 )

That's it, don't do anything else, if you must plot 8 in parallel, then use two NVME's, IMHO 4 in parallel is 12-18 plots per day I have three NVME on my plotter that seems to be the max, two on the two m.2. slots, and one on the PCIE-4x next to the cpu ( closest slot is 4x )

Keep a little fan on all your NVME drives, I like <50C, I see when the NVME fail the controller is usally at >90C, I have asked HOFFMAN(DEV-MANAGEMENT) to add a -maxtemp swtich to chia-pos, just like we have on gpu miners, so we don't kill are cards, and they blew me off, I could/would do this myself, but I too now don't care chia-dev doesn't care, why should I U write bash scripts your self and monitor everything, you can get the PID, so you can kill a process that is destroying the NVME, before the NVME is destroyed. You'll know its destroyed, as once its dead everytime you plot on that NVME it will terminate with this 'mis-match' error. Your screwed, at that point you can reformat that NVME, and use it as a -2 temp; It appears that only the random-access part of the controller is killed ( murdered by chia-dev ), but the sequential is still working.

Everytime you run 4 plots on an NVME, manually run "Sudo fstrim -v /dev/nvmeNNN", this will clean the NVME and keep it plotting as max speed.

Once you get the mis-match error, forget about it, your drive is dead, if its new send it off for a warranty replacement, and lie about its use, and don't tell them you used it for mining, but they can tell by the logs. In time, a short time IMHO samsung and all will declare that 'plotting' invalidates all warranty, they have to do this, what chia-pos does is insane, its the devils work.

anufriu commented 3 years ago

@room101-dev ok, thx mate for clarifying ill check my nvme's

github-actions[bot] commented 3 years ago

This issue has been flagged as stale as there has been no activity on it in 14 days. If this issue is still affecting you and in need of review, please update it to keep it open.

github-actions[bot] commented 3 years ago

This issue was automatically closed because it has been flagged as stale and subsequently passed 7 days with no further activity.