Major issue causing lots of Orphaned Blocks due to compiling with boost v1.55

jo2jo commented 10 years ago

here is the bug, the result is transactions taking very long to verify (even with a very well connected client) and a major increase in orphaned blocks since version 1.5 of dogecoin client, below is mostly a copy and paste but it seemed critical enough that i wanted to be sure it was brought to the attention of the dogecoin developers asap:

(paste) There's a bug with the previous version of Litecoin where the client will randomly stop syncing. This results in your node basically being down until you restart the client. The way to know this is occurring is to check the resource monitor(go to task manager->performance tab). If Dogecoin isn't sending or receiving any data, it's likely down. When you restart the client, you will be several hours behind(whenever it stopped syncing) This was fixed in the newest version of Litecoin, but showed up in the 2nd most recent version(Dogecoin is based on Litecoin, and updating to the more recent codebase of Litecoin likely picked up the issue). This bug weakens the Dogecoin network a bit and can be very annoying. edit: If you aren't seeing transactions show up after you know they've been sent to you(IE mining pool withdrawal), then you are likely affected and need to simply restart the dogecoin-QT client to double check.

This is due to the "boost" library used when building. We were previously on Boost 1.54 I've recompiled the binaries and you can download the latest (fixed) (/end paste)

the issue is described in detail along with some user created patches / fixes:

http://www.reddit.com/r/dogecoin/comments/1wfj2t/dogecoin_15_suffers_from_an_old_litecoin_bug/

as you can see in these graphs the increase in orphaned blocks since version 1.5 release: http://bitinfocharts.com/comparison/orphaned-btc-ltc-ppc-doge-nmc.html

hashfaster commented 10 years ago

any work around you know of?

educatedwarrior commented 10 years ago

@ummjackson , I know you guys are working on a lot of cool stuff - could you tell us where this falls in the list of your teams priorities. This is really affecting my pool to the point where I may have to shut it down. Does this have something to do with the testnet you guys are working on?

billym2k commented 10 years ago

What is the specific bug that is affecting your pool? What are your dependencies? What is the data for the difference between orphans before 1.5 and after 1.5? This issue title is misleading and about the problem while running the QT client, not about the dogecoind dameon running on linux machines.

The graph is also misleading as it shows 0 orphans before and many after, which is simply not true, probably the change to 1.5 allowed this data to be gathered.

hashfaster commented 10 years ago

When I upgraded a 1.2GB wallet.dat from 1.4.0 to 1.5.0, I instantly observed HUGE delays in JSON RPC (over 30 seconds per query). During the 30 seconds, the wallet was constantly flushing and writing 250MB/sec IO to disk (SSD based). This essentially made it completely useless to run as a pool wallet as I was only getting 2-3 payouts per minute (thanks RPC), and since MPOS (frontend) uses RPC for dashboard data, it was causing 30 second page load times (and im most cases, page load failures).

ghost commented 10 years ago

Dave-why don't you create a process to schedule a wallet.dat "roll" at certain intervals during a maintenance cycle... it seems way to big and not designed to handle such large wallet.dat file. Let me know what you think of the approach?

On Tue, Feb 4, 2014 at 2:30 PM, Dave notifications@github.com wrote:

When I upgraded a 1.2GB wallet.dat from 1.4.0 to 1.5.0, I instantly observed HUGE delays in JSON RPC (over 30 seconds per query). During the 30 seconds, the wallet was constantly flushing and writing 250MB/sec IO to disk (SSD based). This essentially made it completely useless to run as a pool wallet as I was only getting 2-3 payouts per minute (thanks RPC), and since MPOS (frontend) uses RPC for dashboard data, it was causing 30 second page load times (and im most cases, page load failures).

Reply to this email directly or view it on GitHubhttps://github.com/dogecoin/dogecoin/issues/208#issuecomment-34097169 .

hashfaster commented 10 years ago

Unless you have a specific procedure in mind, replacing a wallet (new address) is a pretty large undertaking on pools like mine.. If I am thinking what your thinking, it means i have to bounce all stratum servers and insert new wallet, then move ALOT of coins around and risk paying out orphans because if I have confirming blocks (50 confirms) if I replace the wallet, it means I loose the ability to track those immature blocks, which COULD orhpan after I replace the wallet and they are not confirms yet. (I did this once already, actually 3 weeks ago)

ghost commented 10 years ago

Do you have a wallet.dat growth curve projection? It has grown this large in just 3 weeks? There is a related bug where this may find a solution have you seen it? ticket is #187

On Tue, Feb 4, 2014 at 3:49 PM, Dave notifications@github.com wrote:

Unless you have a specific procedure in mind, replacing a wallet (new address) is a pretty large undertaking on pools like mine.. If I am thinking what your thinking, it means i have to bounce all stratum servers and insert new wallet, then move ALOT of coins around and risk paying out orphans because if I have confirming blocks (50 confirms) if I replace the wallet, it means I loose the ability to track those immature blocks, which COULD orhpan after I replace the wallet and they are not confirms yet. (I did this once already, actually 3 weeks ago)

Reply to this email directly or view it on GitHubhttps://github.com/dogecoin/dogecoin/issues/208#issuecomment-34105336 .

educatedwarrior commented 10 years ago

@billym2k - since the first day Dogecoin started till before the 1.5 upgrade, our pool had 4 orphan blocks in total. Now we have 11 - our total more than doubled after the 1.5 upgrade and it's only been 4 days. 7 of those orphan blocks occurred after the 1.5 upgrade.

educatedwarrior commented 10 years ago

@billym2k - right after the 1.5 upgrade, our pool experienced 3 orphan blocks in a row.

add1ct3dd commented 10 years ago

This is a very real issue, we experienced about 5 orphans in 2 days, and have now reverted back to 1.4.1, and no orphans since.

educatedwarrior commented 10 years ago

Last time there was a major release, over half the mining pools were mining on the wrong fork. There definitely seems to be a lack of communication between dogecoin devs and the mining pool community.

Also, when we upgraded to 1.5, we had to resync the entire blockchain. Painful upgrade. Now a painful downgrade back to 1.4.1

add1ct3dd commented 10 years ago

@educatedwarrior always backup your working dir, easy to swap back then :) but yeah it should not be an issue in the first place imo!

educatedwarrior commented 10 years ago

@add1ct3dd , thanks for the tip... that's a good idea. I got a db error when I tried to switch back though.

toxicwind commented 10 years ago

@casalej Our pool is using the static change=dogecoinaddy addresses and this issue still happens.

educatedwarrior commented 10 years ago

@add1ct3dd Did you get a database error when you did the switch back. I got a db error and had to delete the blockchain db and resync. If I didn't get the error, I would have been able to switch back without resyncing the db.

add1ct3dd commented 10 years ago

Nope, we backed up the 1.4.1 working directory, then updated dogecoind, and then started the new version.

When reverting back we just move the 1.4.1 db back, and only had to update the blocks from the time we stopped updating that directory.

billym2k commented 10 years ago

I'm not seeing valid reason behind the orphaned blocks, just cause and effect hypothesis.

educatedwarrior commented 10 years ago

@billym2k , you serious? Well I don't see any of the major pools wanting to upgrade at the moment, until you can prove it is not true. Based on popular consensus right now, 1.5 is buggy.

ghost commented 10 years ago

Could it be sync issues @billym2k ? :D

billym2k commented 10 years ago

I'm saying that clearly there is an issue, but I'm just getting the "there is an issue" part, without much idea of the cause. Possibly sync issues could be the case. It's difficult for me to test, I'm not running a pool.

ummjackson commented 10 years ago

OP (@jo2jo) mentioned that this was fixed in the latest Litecoin release - can someone please reference said commit to the Litecoin code base?

big-big-big-yoshi commented 10 years ago

We have run 1.5 since it came out without any issue on our pool. Can you post your dogecoin conf(with rpc user/password info removed of course)

add1ct3dd commented 10 years ago

Are you 100% it's 1.5? I'm yet to see someone not have the issue, been checking around on IRC.

big-big-big-yoshi commented 10 years ago

Yes I am 100% certain we are on 1.5 We have been on it for a while now without any uptick in orphans or other issues. Maybe its something different in our conf files?

hashfaster commented 10 years ago

The number of transactions also is key here.. a <1Gh pool might have no issues for a while but will eventually. My pool is 7Gh, and i INSTANTLY had an issue and had to revert.

toxicwind commented 10 years ago

This is happening even with a 300MH pool.. Are you sure you are on 1.5 netcodepool, this issue is happening to 4 other dogepool owners I know and even normal people seem to be desyncing, just not as often as the pools. ZC, is this same issue happening as well by chance? Or do yoy have no clue as you have already reverted? https://github.com/dogecoin/dogecoin/issues/217

add1ct3dd commented 10 years ago

We also had 5 or so orphans on 250Mh/s, and our overall block luck is 80%..

Laggy transactions is probably down to the daemon de-syncing, so sounds like the same issue imo.

@netcodepool post your config?

educatedwarrior commented 10 years ago

Went back to 1.4.1 and not one orphan since. I know that is not helpful.

azamatms commented 10 years ago

We had this issue on RapidHash as well, got a total of 12 new orphans. This is happening because of wallet flushes with large wallets. The wallet locks up not synching the chain and orphaned blocks get discovered due to the chain lag.

We separated the payouts and mining wallets. 1.4.1 for payouts and 1.5 for mining.

ummjackson commented 10 years ago

@rog1121 Thanks for the explanation, this makes sense to me. Any ideas on how we could address this? (Assuming it's due to Dogecoin's # of transactions, because we're 99.99% based on the Litecoin codebase if you do a diff)

azamatms commented 10 years ago

@ummjackson see this issue https://github.com/dogecoin/dogecoin/issues/217

Its due to tons of transactions that mining pools like mine send. On 1.5 TX ID's wouldn't even get broadcasted for 4-5 hours due to the wallet flush issue. Seems the best solution is to get rid of BDB and move to LevelDB as fast as we can.

EDIT: I might be mistaken, not sure if 1.5 is still using BDB. Either way, its large amounts of unbroadcasted transactions that are causing this issue.

ummjackson commented 10 years ago

@rog1121 Bitcoin/Litecoin never moved to LevelDB for wallet.dat, they moved for the blockchain database. Both still use BDB for wallet.dat, and we're running on the latest Litecoin code base - so I guess it's down to the sheer number of transactions running through Dogecoin? (yay we're popular, but not yay?)

This is going to require code changes separate to Litecoin to address our ridiculous number of transactions... any ideas what we need to change?

azamatms commented 10 years ago

I'd rather not go too far off from Litecoin base, if Litecoin doesn't do that then we shouldn't either. Maybe look into why transactions are getting backlogged so much? Thats the root of the issue here.

ummjackson commented 10 years ago

@rog1121 So we've looked into this, and apparently some Bitcoin pools had a similar issue due to having reindexed their blockchain downloads from a 1.4 DB (ie. Bitcoin 0.6. * ) rather than doing a full resync on 1.5 (Bitcoin 0.8. * ). Are you able to test this somehow and do a completely blockchain resync from 1.5? Or are you running on a fresh blockchain download from 1.5?

azamatms commented 10 years ago

I ran the payouts on a fresh wallet with a new chain. I did however use the boostrap file from dogechain.info so I'm not sure if that might be an issue?

azamatms commented 10 years ago

@add1ct3dd @toxicwind @zccopwrx @netcodepool @educatedwarrior

Did you just start with a fresh blockchain on 1.5 or did you reindex/use a bootstrap file.

add1ct3dd commented 10 years ago

We reindexed

ummjackson commented 10 years ago

I really hate to do this to pool owners, but are you able to remove all pre-1.5 .dat files and try a fresh sync on 1.5? I have confirmed with @netcodepool that they did a clean resync, and that's the only difference I can see.

toxicwind commented 10 years ago

We were on a completely fresh db and wallet when 1.5 happened actually(I completely moved servers and decided to just start the wallet over)... http://bitinfocharts.com/dogecoin/address/DTzjPeJEirk3i5eQt3jtTBUk4dSjosMvTq http://bitinfocharts.com/dogecoin/address/D9X6eX1L5KFWA2PJ4jAm68JoLNdmMSJaXw http://bitinfocharts.com/dogecoin/address/DEFVgt9FPxRPhzgHGzzrGZPZfnY9bDfybQ http://bitinfocharts.com/dogecoin/address/DDqgC76vsrG3yC14Vez94UY2hBxSLsQpyj http://bitinfocharts.com/dogecoin/address/DLJNapEx6hhKtKRmNnCWF1zRU7sgBcvTPs All transactions have been done only on a 1.5 wallet and the chain was freshly downloaded. That isn't the issue.

azamatms commented 10 years ago

@ummjackson I'll try moving payouts to a fresh synced 1.5 wallet today. With and without a single change address that you guys recently added the functionality for. I'll post back with results later tonight.

ghost commented 10 years ago

From my pool operator (ypool.net)

"I am still using the 1.4 Windows-Qt client. We actually have two wallets running, one only for submitting blocks and one for everything else (payouts, block confirmation etc.) This way the slow down of the transaction-filled wallet will not affect the submission wallet and blocks are always submitted as fast as possible." - jh00

educatedwarrior commented 10 years ago

@ummjackson - I did a fresh resync when we upgraded to 1.5 and still had issues.

ummjackson commented 10 years ago

Guys, I'm at an absolute loss here - we're doing nothing differently to Litecoin. Also, looking at http://bitinfocharts.com/comparison/orphaned-btc-ltc-ppc-doge-nmc.html (zoom in to 3 months) there has not been an increase in orphan blocks since 1.5 was released. Version 1.5 was released on January 27th - if anything it's been lower/more stable since 1.5 came out.

Any ideas or thoughts are appreciated, but we can't identify the root cause or reproduce the issue ourselves (Netcodepool are still not having issues).

OP mentioned they suspect this has to do with building using Boost 1.55, are you all building Dogecoin 1.5 with Boost 1.55? Can you try building with Boost 1.54 or earlier?

azamatms commented 10 years ago

@ummjackson Just upgraded payouts to 1.5.1 built with Boost 1.55

It still seems to lag on payouts, here is a debug log https://doge.rapidhash.net/debug.log

ummjackson commented 10 years ago

@rog1121 - has the upgrade to a specified change address helped at all? Not exactly sure what I'm fishing for in this log file... this initial issue report was about something different I figure?

azamatms commented 10 years ago

@ummjackson I reuploaded the debug log, the first one was incomplete. The last 1,000 lines or so are a round of transactions to users while the wallet is sort of locked up.

@toxicwind had a different theory. Since a pool has such large inputs like 900K doge. When it sends one small transaction it uses that 900k input and almost all of the pool balance goes into unconfirmed leading all of the other transactions to lag due to the first transaction needing 3 confirms.

ummjackson commented 10 years ago

@rog1121 I can see the orphan reporting here, is there where it stalls until you restart the daemon? As far as orphan reports go, the level of logging to debug.log changed between 1.4 and 1.5, so I actually believe that's why people are saying they're seeing more orphans The # of orphans is due entirely to our block time, and there's not much we can do for that short of bumping it to 5+ minutes and hard forking the network (which is kind of against the whole point of Dogecoin, so won't be happening).

@toxicwind 's theory is interesting, and as I said prior I'm completely stumped by this issue - so until we can identify a root cause I'm not sure what to do. Any ideas?

ummjackson commented 10 years ago

Guys, there's a bunch of different issues being listed here. I'm going to close this - please start a new issue with an exact explanation of the behavior you're seeing. For example, are your large wallets flushing, then locking up causing transactions to lag? Or is your issue the number of orphans being found? Please clarify so we can pinpoint this issue.

azamatms commented 10 years ago

https://github.com/dogecoin/dogecoin/issues/232#issuecomment-34590521

dogecoin / dogecoin

Major issue causing lots of Orphaned Blocks due to compiling with boost v1.55 #208