bitcoin / bitcoin

Bitcoin Core integration/staging tree
https://bitcoincore.org/en/download
MIT License
79.08k stars 36.29k forks source link

Apparent memory leak #6876

Closed rustyrussell closed 7 years ago

rustyrussell commented 9 years ago

See http://lists.linuxfoundation.org/pipermail/bitcoin-dev/2015-October/011511.html and thread.

pstratem commented 9 years ago

https://github.com/bitcoin/bitcoin/pull/6872

TheBlueMatt commented 9 years ago

6872 is likely unrelated. The issue seems to be something stemming from getblocktemplate-related memory usage.

older-c-coder commented 9 years ago

Has anyone tried the Qt version to see if it has the same problem?

And BTW, are the release versions (of the daemon & Qt) compiled in debug mode?

Last question: can a unit test be cobbled up to test for this memory issue?

Ron

jtoomim commented 9 years ago

6872 may be related. CreateNewBlock() makes fairly extensive use of CCoinsViewCache. I ran a short (4 hour) test with valgrind --tool=massif, and CCoinsViewCache came up as using a lot of memory. A much longer test run is needed for this apparent memory leak issue to show up clearly. I'm about 24 hours into what I plan to be a 40- or 50-hour massif test. I'll post those results when I have them. It might take even longer than 50 hours for massif to get good results. it's taking 30 seconds for valgrind bitcoind to get through a getblocktemplate call, rather than the usual <= 1 sec. I don't know if the leak is related simply to the number of times GBT is called, or if it depends on the uniqueness of the mempool each time GBT is called.

@older-c-coder I have not tested with Qt. A unit test would likely be difficult to write, since the issue is only apparent in terms of RSS usage after calling GBT frequently for around 24 hours. It's possible that this process might be accelerated (e.g. create a fake mempool from block X, run GBT/CNB, X++;, etc). Even then, such a test is likely to take a while, and we don't yet know if that approach will work at triggering the leak.

morcos commented 9 years ago

There has been IRC discussion of this in bitcoin-core-dev on the 20th and 21st.
I do not think there is an actual "leak". The increased memory usage is mostly a result of the not actually limited size of CCoinsViewCache discussed in #6872 combined with poor memory allocator behavior.
The problem is triggered by calling getblocktemplate which calls CreateNewBlock. In CreateNewBlock every single tx in your memory pool is pulled into the in memory cache of both pcoins tip and a cache view on top of that. The memory allocator also causes this memory to be allocated in a different location for every thread that calls getblocktemplate. Although this memory is freed at the end of CreateNewBlock, there seems to be enough fragmentation that there is still some memory being used towards the end of the heap and the bulk of the memory can not be returned to the OS. This can cause up to # of RPC threads (default: 4) times the memory allocation needed for 1 full memory pool worth of chainstate to remain resident. In addition, depending on the size of your dbcache, if one of the calls to getblocktemplate causes a rehash of the unordered map in your pcoinsTip cache, then that can also take up space in another memory arena because now it has been allocated from a different thread.

I'm working on a short term smaller rewrite of CreateNewBlock which will not require pulling the entire memory pool worth of txs into the cache, which should make the problem much better.

jtoomim commented 9 years ago

http://toom.im/files/massif.out.27357.log Or the raw output, if you want to run ms_print on it with your own options: http://toom.im/files/massif.out.27357

(For best viewing, disable word wrap, like with less -S massif.out.27357.log.)

The highest memory usage appears to be at timestamp 47. Search for "47 68" to jump there. It looks like at that time, a total 46.64% of the heap (411 MB) was in use by CCoinsViewCache via HTTPReq_JSONRPC and CreateNewBlock. The two offending lines in miner.cpp appear to be 175 if (!view.HaveCoins(txin.prevout.hash)) and 348 if (!TestBlockValidity(state, *pblock, previndexPrev, false, false)), the latter of which results in a HaveCoins call later.

laanwj commented 9 years ago

TL.DR from IRC: this seems specific to the glibc multi-thread arena memory allocator. The application deallocates the memory properly, but pages are not given back to the system after deallocation (probably due to fragmentation allocating lots of small objects). Workaround that we determined on IRC is export MALLOC_ARENA_MAX=1, see https://gist.github.com/laanwj/efe29c7661ce9b6620a7#linux-specific

pstratem commented 9 years ago

Can confirm the work around works. ( export MALLOC_ARENA_MAX=1 )

jtoomim commented 9 years ago

Seems to be working for me too. 724M RSS after 10 hours. Ten hours isn't long enough to be certain. I'll chirp in tomorrow if it's still a problem.

Edit: it's a day later, and RSS is at 1020 MB. Normally, I would expect memory usage near 1500 MB by now. Seems that MALLOC_ARENA_MAX is working.

flound1129 commented 9 years ago

2.9G RSS after restarting with export MALLOC_ARENA_MAX=1 this morning (about 14 hours ago). Still undesirably high imo.

TheBlueMatt commented 9 years ago

@flound1129 what platform are you on, and what is your -maxmempool setting and -dbcache setting?

flound1129 commented 9 years ago

uname -a

Linux p1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.3 LTS Release: 14.04 Codename: trusty

haven't changed maxmempool or dbcache setting from defaults.

TheBlueMatt commented 9 years ago

Wait, are you running master with the mempool limiting stuff in, or a released version?

On October 27, 2015 10:49:02 PM PDT, Adam McKenna notifications@github.com wrote:

uname -a

Linux p1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.3 LTS Release: 14.04 Codename: trusty

haven't changed maxmempool or dbcache setting from defaults.


Reply to this email directly or view it on GitHub: https://github.com/bitcoin/bitcoin/issues/6876#issuecomment-151732539

flound1129 commented 9 years ago

This is on 0.11.1

sipa commented 9 years ago

0.11.1 doesn't have any memory limits at all for the memory pool.

flound1129 commented 9 years ago

So you're saying this issue is fixed in master? Or you would like me to test with master? My mempool is only around 162 megs.

jtoomim commented 9 years ago

No, sipa is saying there are two possible reasons why your memory consumption could be high. One reason is that you are running into the multi-arena allocating/fragmentation bug described here in this issue, #6876. The other is that your mempool is getting bloated from the flood of 14780 byte transactions we're seeing right now.

Try sticking minrelaytxfee=0.0002 into your ~/.bitcoin/bitcoin.conf file, or running with the minrelaytxfee=0.0001 command line option. That should reduce your mempool size by about 161 megs.

flound1129 commented 9 years ago

I'm not worried about having a large mempool, or really even worried about 3GB of total usage. I'm only concerned if there is an actual leak.

TheBlueMatt commented 9 years ago

I dont think we've seen any evidence of an actual leak, just bloated mempools and poor memory allocation algorithms.

jameshilliard commented 8 years ago

I was seeing this as well on my pool's node, it would appear setting limitfreerelay=0 resolves it. This bug appears to cause severe GBT latency that gets worse the longer your node is up. Using my pool block update bench-marking tool my pool went from near the bottom of the list to the top by simply setting limitfreerelay=0. My guess is there is a bug or leak related to mempool handling of free transactions.

sipa commented 8 years ago

Large mempools result in slow GBT calls. This is not a memory leak as far as we know, just not sufficiently optimized code, and no actual limit on the mempool's size.

There is a lot of active work around improving that. One is the mempool limiting code (which has been merged into master already, but not yet in a release). There are also pull requests in flight to rewrite or otherwise optimize its performance using precomputed indexes.

jameshilliard commented 8 years ago

I noticed there was a massive discrepancy between "currentblocksize" : 473630 from getmininginfo and "bytes" : 6934459 from getmempoolinfo even thought the transaction count was roughly the same between those two RPC calls, setting limitfreerelay=0 resolved that and they now match. BTW I also run minrelaytxfee=0.0001.

jtoomim commented 8 years ago

I believe jameshillard's issue to be different from the memory fragmentation issue reported in this thread. As far as I understand it from IRC, he was getting a lot of the 14790 txns into his mempool via limitfreerelay=default, like these two (from his getrawmempool RPC):

https://www.blocktrail.com/BTC/tx/951761082b38167ca92a94cce9f0783157434ffc66ff39715b1ac30a72cd6611

https://www.blocktrail.com/BTC/tx/5935e25d931734e42e0fff881df79ea9e0d3f4c9d552e9ac20f8825b29049f0a

These two transactions were never getting evicted from his mempool, and were not getting incorporated into blocks by other miners. These were 2 of the 4 transactions that I checked, indicating that very roughly 50% of his mempool was filled with spam. Much more of them were in his mempool than could fit in his 5% priority block space, so his GBT block sizes were kept small even though his mempool was bloated. He was complaining of mempool size, not of RSS usage. I do not think he was encountering any bugs, just features that he did not understand and that may have not been designed optimally in terms of UX.

jtoomim commented 8 years ago

"Using my pool block update bench-marking tool my pool went from near the bottom of the list to the top by simply setting limitfreerelay=0." -- Actually, it's probably just that you restarted your server, which clears out the mempool. Setting limitfreerelay=0 would change the long-term slope of your GBT latency graph and mempool size to be closer or equal to zero, but the y intercept will be low in any case.

jameshilliard commented 8 years ago

Actually, it's probably just that you restarted your server, which clears out the mempool.

I've let it run for long enough that I don't think that's the issue, before setting that flag GBT performance would degrade in a matter of hours. After changing limitfreerelay to 0, "currentblocksize" from getmininginfo and "bytes" from getmempoolinfo are now the same. The transaction counts were always roughly the same between getmininginfo and getmempoolinfo before and after this change, I guess it's possible that there were some extremely large free transactions taking up all the space.

Setting limitfreerelay=0 would change the long-term slope of your GBT latency graph and mempool size to be closer or equal to zero, but the y intercept will be low in any case.

The issue wasn't with the y intercept it was that it was sloping up over time.

maflcko commented 7 years ago

This is likely fixed since 0.12