Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.81k stars 2.02k forks source link

My farming/harvesting seems broken #15485

Closed BertBril closed 1 year ago

BertBril commented 1 year ago

Discussed in https://github.com/Chia-Network/chia-blockchain/discussions/15478

Originally posted by **BertBril** June 9, 2023 I've been farming since August 2021. During that time I had some hardware problems and I had to relocate. When the system is up, I get a nice harvest every month or so, nicely in line with (probably a bit better than) the 'estimated time to win'. But by around March of this year I haven't harvested anything anymore. I know we are talking probabilities here, but it's getting strange. This graph says it all: ![no_harvest_for_months](https://github.com/Chia-Network/chia-blockchain/assets/83375976/e5af7193-2c58-43be-8333-e78f4998c7fc) I'm wondering whether I've missed some change in setup, it looks very much like I'm no longer harvesting. All the outputs like 'chia farm summary' look just like before but for some reason nothing comes in. My question is therefore: how can I figure out whether something is wrong, and if so, what?

Nobody responded to that in the discussion, but I think the probability that this is not some kind of software issue is very low. Maybe by coincidence the lack of harvesting started with using version 1.7. I'm on version 1.8 now (Ubuntu LTS - 20.04 64 bits), here are some outputs:

% chia farm summary Farming status: Farming Total chia farmed: 24.xxx User transaction fees: 0.xxx Block rewards: 24.000 Last height farmed: 3254628 Local Harvester 1037 plots of size: 105.xxx TiB Plot count for all harvesters: 1037 Total size of plots: 105.xxx TiB Estimated network space: 24.675 EiB Expected time to win: 1 month and 3 weeks

% chia plots check -g /mnt/SG0 &> chia_dir_check.txt chia_dir_check.zip

iTZUNAMI commented 1 year ago

i have similar problem, 1 month to get xch, i got the last one on april 17, now 2 months without problem/connection timeout/etc and still waiting.

BertBril commented 1 year ago

i have similar problem, 1 month to get xch, i got the last one on april 17, now 2 months without problem/connection timeout/etc and still waiting.

We have to realize that our chances are very very low, and the only reason we should be getting a 'hit' every now and then is the sheer amount of challenges. So if you want to make your point please share some more details, like average time to win or better, a good plot of XCH vs time. I think my plot very definitely shows that there is something wrong (of course, you may not believe my data, but that's another issue altogether. If that would be the case, I would still be very interested in the hypothetical approach the people in the community would take to investigate the matter ...)

hbroer commented 1 year ago

You could not win for years or get a bunch in a row for a day. Thats how this random thing works. If you want more predictable income go for pooling.

Expected time to win: 2 months and 3 weeks 1/2/23, 9:45 AM 12/21/22, 3:28 PM 11/9/22, 12:13 PM 2/25/22, 11:47 AM 2/3/22, 10:09 AM 1/30/22, 11:09 PM

Last year 5, and 3 of them together in January and February. Two together in November and December. This year only one so far in January. Only offline for updates. ^^

BertBril commented 1 year ago

You could not win for years or get a bunch in a row for a day. Thats how this random thing works. If you want more predictable income go for pooling.

I think the graph I included shows that it does not work like you say: 'not win for years and then win a bunch in a row'. It shows that the process is reasonably predictable if you look at it at the appropriate time scale. Having over 100 TB requires this scale to be in months. If you have PB's, then you can expect regularity at a much smaller scale, and if they don't get a hit for several weeks they should really 'lift the hood' to figure out what's wrong. There's a probability for everything, also for distributions of time between hits. I am not saying that what happens now is impossible, it's just becoming more and more unlikely that this is just a 'bad streak'.

iTZUNAMI commented 1 year ago

i have similar problem, 1 month to get xch, i got the last one on april 17, now 2 months without problem/connection timeout/etc and still waiting.

We have to realize that our chances are very very low, and the only reason we should be getting a 'hit' every now and then is the sheer amount of challenges. So if you want to make your point please share some more details, like average time to win or better, a good plot of XCH vs time. I think my plot very definitely shows that there is something wrong (of course, you may not believe my data, but that's another issue altogether. If that would be the case, I would still be very interested in the hypothetical approach the people in the community would take to investigate the matter ...)

yes i know, i'm farming since may 2021. i have 166TB and i'm getting correctly on self plot 2xch each 30-40 days constantly since 6-8 months. Sometimes i got lucky, but in last two months i got nothing... and the only strange thing i see is the upgraded version from 17 to 18 and the security hole that the developer can't tell us, they just write it on the changelog of the 18.0 version and that they will tell us or not in the future. Until that we can assume that the network had some trouble with a lot of nodes or worst with some corrupted nodes. you can read here: https://github.com/Chia-Network/chia-blockchain/releases/tag/1.8.0

hbroer commented 1 year ago

I think the graph I included shows that it does not work like you say: 'not win for years and then win a bunch in a row'. It shows that the process is reasonably predictable if you look at it at the appropriate time scale.

Just because something is unlikely it does not mean that it can't happen. 2 years is not a big timescale. It only has 13 datapoints. I also don't say that the farm is working perfekt, there can be something wrong. But the chart also shows that the latest reward is not far off ;-) I just prefer to go for pooling because then I see way quicker if something is wrong. Also watch the logs.

BertBril commented 1 year ago

I think we actually do agree, it's just that we put the emphasis on different aspects. This is the classical 'I haven't had two aces for so many hands now' in poker. If 'so many' crosses the 500 line, then by all means it's very possible that this is just bad luck. But at some point it's a good idea to take a look at the dealer.

So yes, I may be in 'bad luck' land, but compared to the 2 aces in poker (1/221 vs a hit every 1.5 months) I'd be on something like my 600th hand without two aces. So I'm starting to consider that there's something wrong with the dealing process, looking for the obvious places. As a software engineer, I know that bugs can be easy to introduce and hard to find, espeically in situations like these. Thus, I'm angling for 'known' issues, or 'suspicions'. No attacks, no bad sport, just on the lookout.

hbroer commented 1 year ago

Lets do a quote from the internetz:

The gambler's fallacy, also known as the Monte Carlo fallacy, occurs when an individual erroneously believes that a certain random event is less likely or more likely to happen based on the outcome of a previous event or series of events.

Randomness does not care about if you lucky before or bad luck. It is at any point random. But I am sure you know that.

In general it would be nice that the log files would be more clear and not spaming all over the place with "you can ignore that" informations. I often have the problem to find the real error because there are a lot of data which is useless. And that can't configured. For example I have one bad USB drive which always goes into Standby after a while no matter how I tried to configure the energy saving modes. That alone causes continuesly triggering the 5s warning for possible missing reward. It normaly is about 5 to 6s if it spins up again. But I can't set that value to 7s or just ignore that device at all. On the other hand I had a bad drive which causes the farmer to hang up for many seconds which triggered just a few other warnings and the farm was near dead. It took a while to find the real problem in all that spamed log files. IMO this is a problem on Chias side. The logs are kind of useless for smal farmers which have not the time or knowledge to add a tool just for log analysis. And even with such a tool, I am sure some problems are unnoticed. With pooling I mostly see if something is wrong when I get stales or the partials stop or are significant lower than normal.

iTZUNAMI commented 1 year ago

chia docs: You can find your win time estimation from the Farming tab in the Chia software. It's important to note that this is just an estimation. The real time could realistically be 2-3x this amount if you're unlucky or .5 if you're lucky.

Than if you should win 99% after 1 month, than wait for 2-3 month as max. After that without any new reward i say that there is a problem, local or global.

In my opinion i see this last month a 20% increase of the total chia chain ( so we should increare our time-winning time/ratio), and no news about the secury hole on all <18.0 client version

BertBril commented 1 year ago

"Randomness does not care about if you lucky before or bad luck. It is at any point random." If I flip a coin 10 times and it comes up heads 10 times, and I don't know whether the coin is 'true', then my bet on the 11th time will be way way different than if I know for a fact that the coin is true. I'm sure you know that these are two different branches of statistics. What we are talking about here is: P0: I am drawing from the same probability distribution as everyone else. Right now, I'm pretty close to 95% that P0 needs to be rejecteed. All I'm saying is that the chance of something wrong in the 'delivery chain' rises when all other factors are unknown. We just assume that the coin that is being flipped is 'true', but staying on that path indefinitely is religion rather than reason. I'll keep monitoring this, if I'll be dealt no aces for the 1000th time I'l be pretty convinced that I am not drawing from the same distribution as others.

github-actions[bot] commented 1 year ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

BertBril commented 1 year ago

It's getting highly probable my harvesting is broken, just look at this zoom-in of the stats.

xch_vs_date_jul_4

How do I investigate this? Literally everything I look at seems fine...

github-actions[bot] commented 1 year ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

wjblanke commented 1 year ago

You could put some plots on a pool to make sure your partials are being processed correctly. That will take the reward variance out of your concerns. If overall net space has not dropped then farmers are still winning blocks at the same rate.

BertBril commented 1 year ago

I'm very disappointed by the community not even being able to give me ONE useful tip on how to figure out why my harvesting has stopped. The chances that this is just a bad streak have dropped way below what counts as 'proof' in social sciences. Also big coincidence that upgrading from (I think) 1.4 to 1.8 coincides, but apparently, that's not interesting either. I'll not join pools so this spells my exit from this project. Pity, it seemed so promising at first.

iTZUNAMI commented 1 year ago

i got 2 XCH last week with a lot of months without nothing, so i confirm that it's not our problem (in my case) Just to know the netspace increased from 20EiB to 26.5EiB in this 4 months..

hbroer commented 1 year ago

not even being able to give me ONE useful tip on how to figure out why my harvesting has stopped

  1. Two comments suggested to use a pool because then you have the partials wich are more constant because they apear more often.

  2. You think your harvesting stoped. But in reality it is more likely it isn't and you are just unlucky.

It is way easier to find a problem if there are a few thousands of partials per day, than just a few rewards per year.

BertBril commented 1 year ago

Oh come on. It is very very unlikely that 'I'm just unlucky'. With 1.5 months 'Estimated time to win' (which is nicely supported by the regression line in the plots), it is downright ridiculous to NOT doubt your setup after almost half a year without rewards.

And now, well, apparently it's so hard to diagnose problems that the only way to go further is joining a pool?

hbroer commented 1 year ago

If somehing happens randomly on average (over many years!) every 1.5 month it is very hard to debug and it is absolutely not "very very unlikely" that you are just unlucky. On top of the very possible "just unlucky" it could even possible to have had bad luck with a ISP problem right around the moment when you could have got an reward. Or too slow HDD just that moment. Or what ever. Something you can find in your logs.

Its funny, becaue you went solo with your small farm, so you choose the risk to be unlucky without any rewards at all. And now the risk of beeing unlucky happens, it is everyons fault but not yours. ;-)

BertBril commented 1 year ago

If somehing happens randomly on average (over many years!) every 1.5 month it is very hard to debug and it is absolutely not "very very unlikely" that you are just unlucky.

Assuming a normally distributed process you can easily look up that there is something wrong in my setup way beyond alpha=95%.

Something you can find in your logs.

I'm using the same disks for 2 years now. So I kind of think I can ignore: "2023-04-04T19:40:40.190 harvester chia.harvester.harvester: WARNING Looking up qualities on /mnt/TO3/chia_dir/plot-k32-2021-05-24-09-51-c714673569887f6f8db4ea033f723c703dff1113b3aae9aee726920ef6a88049.plot took: 6.422817230224609. This should be below 5 seconds to minimize risk of losing rewards." It has always been like that, and the rewards came in all the time. I don't know what I'm looking for further, because as far as I can see there is nothing particular, nothing that looks like 'Reward lost because ...'. I do get other ERROR's, but they vary wildly.

So, the question becomes: what on earth am I looking for? Is there a specific message emitted when a reward 'bounces'? (hopefully with the actual reason)

PS Your last remark is awful. You clearly ignore the theory of the statistics involved but sure as hell want to punish people for what you find idiotic behaviour. I'm also very open for a discussion why joining a pool is a bad idea because of fundamental privacy issues, but you probably are not interested in that either.

hbroer commented 1 year ago

joining a pool is a bad idea because of fundamental privacy issues

I am curious. ^^

hbroer commented 1 year ago

For the Error thing: https://docs.chia.net/checking-farm-health/

Maybe set logging to info, write a script to filter for some data like different error/warning types and also for the proofs. Then you can create a statistics about it. But if you find proofs then, and I rly dislike to repeat myself, you are just unlucky, no matter what you think about statistics and the impossibility to be that unlucky.

ANd for the "WARNING Looking up qualities" warning, I get them too but only for one disk. Don't let them spin down into standby. With many you can disable that.

BertBril commented 1 year ago

joining a pool is a bad idea because of fundamental privacy issues

I am curious. ^^

It's only a matter of time, if not already in progress, that the pooling people will get a visit from certain governement instances

BertBril commented 1 year ago

For the Error thing: https://docs.chia.net/checking-farm-health/

My config.yaml looks pretty different from what this document describes. Maybe because I started when mainnet had just gone up, I have loads of entries for testnets. I'll have to go through the setup and see what's going on. Man, things used to be plug and play.

The spin-down issue is also a bit weird, I was told at the time that this was taken care of by the protocol which allowed 30 seconds (more than enough for all the disks I bought). The error message suggests this has been brought down. I'll use hdparm to disable the spindowns.

hbroer commented 1 year ago

joining a pool is a bad idea because of fundamental privacy issues

I am curious. ^^

It's only a matter of time, if not already in progress, that the pooling people will get a visit from certain governement instances

I smell conspiracy theories. lol

nuff said here