JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
775 stars 545 forks source link

HODL error #237

Closed platinum4 closed 4 years ago

platinum4 commented 4 years ago

Hi, please see the bug below:

i7-4790K 32GB RAM

image

JayDDee commented 4 years ago

I didn't know anyone was still minig hodl. Thanks for reporting. I'll fix it in the next release.

Please report any other issues with hodl. If you're willing to do some verification testing maybe I can clean it up a bit.

platinum4 commented 4 years ago

Sure, let me know please. Thanks again for your hard work.

JayDDee commented 4 years ago

The "BUG" is in the way share results are reported. This may account for the missing share stats so you might want to wait for the next release to report other issues.

The invalid share is a concern, do they happen often?

platinum4 commented 4 years ago

Not sure, the miner closes after that first share, I had to add a pause line to .bat file to catch it before it closed the window.

It happens every 1st share though so it doesn't mine effectively at all.

JayDDee commented 4 years ago

You can open a new issue for that or we can track it here. Is there any error message on exit? A segfault? or just a silent exit?

Use the command line directly to get better error info.

platinum4 commented 4 years ago

Silent exit, hang on, how can I enable logging for you on this I can't see it in readme or --help

JayDDee commented 4 years ago

I suggest when you identify a problem you go back to older releases to compare. Fining the release that broke it is 90% of solving the problem. There hasn't been much activity in the hodl code so you should go back by major releases, then focus in on the minor release.

JayDDee commented 4 years ago

Copy /paste is a problem with the command line but if you use the command line directly you can capture a screenshot of the output.

Just to confirm the problem, do you always get one good share then an invalid share then exit?

You can also add -D to the command line to enable debug output.

platinum4 commented 4 years ago

Confirm yes v3.11.9 does what you are asking image

platinum4 commented 4 years ago

I'm trying v3.9.5.3 right now

JayDDee commented 4 years ago

I hadn't noticed the multiple unconfirmed submits before. That could be a server or networking issue.

You should note the older releases will have different output, the focus should be on the share results and the silent exit, not the messages. Silent exits is the worst kind because there are no clues.

platinum4 commented 4 years ago

Yeah I know, v3.9.5.3 does not exit, look below image image

platinum4 commented 4 years ago

cpuminer-avx2 -a hodl -o stratum+tcp://hodl.optiminer.pl:5555 -u HFX3N3DZKMjsJsRTwfAgq4iqoCSNh12Qfb -p x -D

JayDDee commented 4 years ago

Looks like 2 sepertate problems (3 with the "BUG").

  1. Invalid shares since before v3.9.5.3. (suggest testing 3.8.8.1)

  2. Silent exit after v3.9.5.3.

Both are show stoppers. The replies are slow but appear to arrive eventually, you just don't see them when the program crashes before they arrive.

It's an interesting pattern, one burst of 6 shares submitted with the first one being valid and the others invalid, repeat. This pattern deson't look like anything the miner would do

The miner just does the same thing over and over again, when it gets a result it thinks is valid it submits it to the server, the server verifies it and sends a reply.

I think the server is doing something funny. It could explian the burst of shares, the delayed replies and the result pattern. I can't think of anything the miner could do to produce these symptoms.

You could also test with hodlminer and wolf-hodleminer to confirm it''s a server issue.

platinum4 commented 4 years ago

v3.8.8.1 works image

platinum4 commented 4 years ago

v3.9.4 works image

JayDDee commented 4 years ago

You can ignore much of my previous post. You don't need to test with another miner if you find a working version of cpuminer-opt.

I still don't understand how the miner could produce the pattern but when the exact release is identified I'll have somewhere to look.

platinum4 commented 4 years ago

v3.9.5.2 appears to not work; look at how it is receiving jobs image

platinum4 commented 4 years ago

job -1 seems to be the issue I would be guessing.

platinum4 commented 4 years ago

Same job -1 on v3.9.5.1 image

platinum4 commented 4 years ago

job -1 v3.9.5 image

platinum4 commented 4 years ago

v3.9.4 is the last version that worked with -a hodl if you get a chance to check this out that'd be great dude.

platinum4 commented 4 years ago

Even the miner loader has mentioned this has been a problem for a bit image

JayDDee commented 4 years ago

I don't understand your last post, what's a miner loader?

Thanks for the good work, I now have a lead I can follow up.

platinum4 commented 4 years ago

Powershell script for the miner in a multiminer, anyhow, I think it happened during restructuring from v3.9.4 to v3.9.5

JayDDee commented 4 years ago

I've reproduced the problem exactly on Linux except for the silent exit., Linux keeps hashing. V3.9.5 was a big release with big changes to hodl and core code.

platinum4 commented 4 years ago

Thanks buddy, I'm sure once you get it fixed there when it's compiled to Windows it should be fine.

JayDDee commented 4 years ago

I think I understand the crash, the jobid is too long and causes a buffer overflow when submitting a share.

I also have a fix for the "BUG" . It now shows some share stats but not all.

But I still get the burst of submits followed by one accepted and the rest rejected. That is a very strange pattern that will require some time to think about what could cause it.

The goal is to get the shares working and live with whatever stats issue remain as they are specific to hodl and likely due to the hodl stratum not providing the data.

If I don't have a solution for the rejects soon I will release what I have ( I was getting ready for a release when you reported the issue) and then look deeper into hodl. You wojuld be able to confirm the crash was fixed on Windows, if nothing else.

The excessive messages is a problem but a low priority.

platinum4 commented 4 years ago

Sure, I am willing to test whatever if you need it, not sure if you have a version history between 3.9.5 and 3.9.4

JayDDee commented 4 years ago

Oh yes I have the version history, 4 snapshots between 3.9.4 and 3.9.5 but there's nothing obvious that changed. Hodl code didn't really change just a small administrative change. It appears hodl was a victim of changes to code that's used by all algos but only Hodl broke.

JayDDee commented 4 years ago

With some stats now working I can see what appears to be submitting the same share over and over again, the share diff is exactly the same for all shares in the burst.

JayDDee commented 4 years ago

This is going to be difficult.

I have seen the same share submitted by multiple threads and the same share submitted multiple times by the same thread. Both should be impossible.

The only change to hodl code was an interface change that required the same code change for every other algo as well with no ill effects.

The major change in v3.9.5 was the introduction of statistics. The stats are gathered in mining functions but should not interfere with mining in any way. again it affects all algos and only Hodl broke.

The crash is believed to have been caused by a job id that was longer than the buffer. This was the result of tracking job ids as part of the stats feature. So it is possible for the stats code to indirectly affect mining.

At this point my only lead is to follow up with the job id to see if there are other places it could overflow a buffer. This would corrupt data and result in unexplainable behaviour and we certainly have unexplainable behaviour. It would also explain why hodl broke with no significant changes to its code.The excessively long job id may the the nexus.

I'm going to go agead with the next release and pick up this issue afterwards when I can focus a little better.

In the upcoming release you should (hopefully) expect the following:

The fix for the rejects will hopefuly be in the following release.

platinum4 commented 4 years ago

You're good dude, as you mentioned, I'm probably the only person on earth trying to mine this at the moment.

JayDDee commented 4 years ago

LOL. I've considered just saying use the old version, and it may eventually come to that but I'm not ready to give up yet.

platinum4 commented 4 years ago

No problem maybe a fresh set of eyes after the sun has moved around might help

JayDDee commented 4 years ago

Have you tried v3.12.0 yet?

platinum4 commented 4 years ago

image

JayDDee commented 4 years ago

Did it crash?

platinum4 commented 4 years ago

Yes you can see it exit to DOS prompt due to a crash after that first attempt at a share.

JayDDee commented 4 years ago

It's better if you don't use a bat file for testing, especially when debugging a crash or silent exit.

So, it still crashes. That means I fixed a different crash. This is a step backward.

I'm going to have to take a different approach. I'll ignore the crash fo now and focus on the changes in 3.9.5 that broke it. I'll try removing some of that new code to see if I can fix the rejects without breaking anything else. If I can identify what code broke it I can figure out why.

JayDDee commented 4 years ago

Fixed it!

It was a stupid error in hodl code, I used the type instead of the variable name in a function call. I have no idea why it compiled but that bug explains why every thread was trying to submit the same hash.

The stats also work. The only remaining issue I see is the repeated job logs. That is for another day. I'll be releasing the reject fix soon.

There is also the delayed replies but that's not a miner issue.

JayDDee commented 4 years ago

The repeated job logs may not be fixed. The problem is unique to hodl but the solution would affect performance of all algos. The fix is to compare the job ids before displayig the log but that involves an expensive string comparison in code used by all algos.

platinum4 commented 4 years ago

Isn't it fun to hunt down a genuine problem every now and again, thanks so much for your effort dude!

JayDDee commented 4 years ago

But I'm pissed the compiler didn't catch it. The arg is supposed to be a variable of the type, not the type itself. It should have been a compile error.

platinum4 commented 4 years ago

I agree based on the coding I did in seventh grade what you are saying makes sense if it should not have been passable I don't see how it compiled.

JayDDee commented 4 years ago

cpuminer-opt-3.12.0.1 is released. Please test and report any problems.

platinum4 commented 4 years ago

Looking good my dude, seems to be accepting shares as normal. Thanks again!