Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.83k stars 2.02k forks source link

[BUG] Last Attempted Proof lookup times got worst after installing 1.2.1 / 1.2.2 #7513

Closed Jacek-ghub closed 3 years ago

Jacek-ghub commented 3 years ago

Describe the bug

I am using PSChiaPlotter to show heatmaps of Last Attempted Proof timeouts (highly recommended to see the health of your harvesters). Before upgrading to v1.2.1, out of about 10k attempts, I had maybe one/two timeouts that were over 2secs, and none over 5 seconds (everything was basically green). After upgrading to v1.2.1, and today to v1.2.2, about one out of every 20/30 attempts is over 2 seconds, and about half of that over 5 seconds (plenty of red). Maybe v1.2.2 is about 10% or so better than v1.2.1 (little bit less those timeouts),

I am running it on Windows. I didn't reboot or touched HDs when upgrading to those new Chia versions. That box is running a full node, but is not plotting, just harvesting, and letting one extra harvester to connect to it.

To Reproduce

Steps to reproduce the behavior:

You would need to install PSChiaPlotter and use the harvester heat maps feature (Start-ChiaHarvesterWatcher -MaxLookUpSeconds 2 -DebugLogFilePath \host\log_folder\debug.log). However, there would be a need to see the pre v1.2.1 logs with INFO there, so the output would cover both before and post info. It can be done manually, but would require a bit of work to process those logs.

Expected behavior Pre and post v1.2.1 averages for "Last Attempted Proof" lookup times should look about the same

Screenshots

Desktop

Cumulo56 commented 3 years ago

Same problem here.

felipewove commented 3 years ago

I got the same issue.

Does rollback to 1.2.0 solve?

Jacek-ghub commented 3 years ago

Sorry, I have not tried to rollback to v1.2.0. I assumed that v1.2.3 will come shortly to fix it.

By the way, did you use that PSChiaPlotter heat maps module to check it, or you did a manual check?

AndyRPH commented 3 years ago

https://github.com/Chia-Network/chia-blockchain/issues/7359#issuecomment-881730771

Jacek-ghub commented 3 years ago

@AndyRPH

I don't think that this has anything to do with portable plots (but maybe I am not right here).

I have two Win10 boxes that are harvesting, and I run PSChiaPlotter harvester heat map tool (windows only) against those two boxes. The box that started having issues after upgrading to v1.2.0 (and v1.2.1/2) has a mix of plots (most are OG, just a handful are portable). On the other hand, on the box that has just portable plots, all is green/good (but has less plots, though).

Also, the box with those mixed plots (that has timeouts) only harvests (but it is also a full node). The other box both plots (MadMax) and harvests. You can clearly see that when new plots are finished and being moved to their destinations, that 'timeout green' box increases lookup times, but not much - basically never turns orange/red (just goes to yellow).

I would really suggest using those harvester heat maps, as you don't need to hand-fish for those timeouts, and just a quick glance gives you a status of those boxes, also providing plenty of history to compare to (that is how I could clearly see the change since upgrading - it started to turn red right away). The person behind that plot manager / harvester health check (@MrPig91) put a nice video how to use those heat maps. I hope it can help/simplify to nail down this issue.

AndyRPH commented 3 years ago

I don't either, but what is the harvester data going to show if you have 0 proofs found. It was only when nft plots came around that I found the issue since they submit partial proofs to the pools, and my partials were taking 160 seconds.

Also, I'm not sure any of those tools work for me since I edited my config to break each module of chia out into its own log files rather than one big debug.log file.

Jacek-ghub commented 3 years ago

That tool is checking the debug log that you provide to it and searches for lines with timeouts. I don't think it bothers to check anything else in those logs. So, potentially if you just keep extracting/appending just those lines to a new file, that tool may work. At least, I think it works like that. Again, the creator/owner of that utility is really helpful, so if it doesn't work this way, you are making a good case to modify it (remove any other checks from the file).

Also, in my case, I share over the network those logs, and run that tool against those harvesters (one instance against one harvester), so I can quickly compare how those boxes behave - don't need to go to those boxes, and check each one individually.

Maybe check that video that I provided link to, and see what it does. If it seems reasonable, but you have some extra requirements, just post on his Issues github pages, and you will get response rather quickly.

Update: What I meant to say, that tool check all timeouts, not just those that found some proofs.

AndyRPH commented 3 years ago

Yeah I think that's my concern, with just solo plots unless you're terrible lucky you aren't finding proofs, and so your times will be super fast anyways. It's only the found proofs that seem to be dramatically slower.

Jacek-ghub commented 3 years ago

I have not found a proof yet, so cannot comment on that. However, after v1.2.0+ upgrades, just running those proofs without finding anything degraded. So, potentially the root problem is not just with those proofs that are found, but how they are doing those lookups right now (as such is exacerbated with found proof).

So, based on what you wrote, maybe the issue in this thread is a bit closer to the root cause (bad code around lookups in general). So, again running a tool like those heat maps may give an early warning that something is going South.

Also, to me that means that the full node UI should provide some stats about the history, not just flipping recent processing. Looks like everything is rather unstable, so trying to narrow issues down is rather challenging.

kalufinnle commented 3 years ago

seems to arise from the python import error: name 'deserialize_and_run_program2' from 'clvm_rs'

Handri-Kosada commented 3 years ago

I have pi4 8gb with ubuntu 20.02 + chia 1.2.2 external usb hdd with exfat and NOT using FUSE to mount.

Same issue, looking for partial is very slow.

Never have this issue before.

Jacek-ghub commented 3 years ago

I have installed v1.2.3. That basically killed my harvester that was receiving plots from the plotter. A lot of lookups over 30 seconds (before, 99% were below 1sec). Actually, I think that before v1.2.0, most of the lookups took less than a tenth of a second, where right now are never going below a tenth of a second.

I moved around disks, so the plotter is just plotting locally (no final move to the harvester). That made my harvester really happy. However, the plotter got screwed (a bunch of timeouts over 30 sec). It had initially 5x8-10TB drives connected, where three were just three NFT (3x8TB), and two OG (2x10TB) drives, where those OG plots were being replotted. Trying to fix it, I removed those three NFT drives from the plotter, and just left two 10TB drives with OG plots to be replotted. No improvement.

The plotter is a Win10 box, i9-10900, with 128GB RAM, Using NVMe for temp1 and RAM for temp2 (MadMax plotter). MM plotter leaves the plot on NVMe, and a batch file moves it to a final destination (a harvester, or a local drive). Those two HDs being replotted are WDC 10TB Red drives (over 200MB/s sequential xfrs - plots). Nothing else is running on that box. Just to be clear, the harvester / full node is win10 on i5-8255U (mobile version, really weak).

I have tried two different drives (both WDC 10TB Red) - same thing. However, when plotting OG plots before to the same drives (also using MM), and doing xfrs to harvester, there were basically not even yellow or orange blocks there.

So, it is

  1. not the network (same thing whether over the network or locally),
  2. not the drives (OG plots xfrs were not causing any issues),
  3. not the transferring script - basically the same batch file (no change how those plots are moved).

The only change is the chia harvester. Again, before I was doing xfrs over the network to that low-power box and all was running smoothly.

Attached are heatmaps that show the trend on that plotter. v1 2 3 - plotting to HD

Jacek-ghub commented 3 years ago

By the way, assuming that one/some harvesters are affected like that (due to new plots coming), is it possible that the full node will also be degraded (e.g., waiting with all good results on hand for such one outlier)? I belong to a pool, and it looks to me that the daily percentage dropped. Of course, I understand that part of the system is luck and tomorrow it could be a lucky day, but ...

So, the question is, whether a plotter should be isolated from the working network, but once a disk is full, such disk should be manually moved to a working harvester.

keliew commented 3 years ago

Assuming you have plenty of plots, could you try to remove all plot directories, and add one by one after seeing the results of each plot directory?

github-actions[bot] commented 3 years ago

This issue has been flagged as stale as there has been no activity on it in 14 days. If this issue is still affecting you and in need of review, please update it to keep it open.

github-actions[bot] commented 3 years ago

This issue was automatically closed because it has been flagged as stale and subsequently passed 7 days with no further activity.