decentraland / bronzeage-node

DEPRECATED: This codebase will not be mantained anymore, and formats are going to change when we move to the Decentraland Iron Age
https://decentraland.org
Other
104 stars 22 forks source link

Node on Linux Crashing with Same Error #70

Closed thoragio closed 7 years ago

thoragio commented 7 years ago

My node running on Linux (Ubuntu 16.04) has been consistently crashing with this error (see below) for the last couple of days. I'm running the latest master from 12 July.

Let me know if there is anything else I can provide or try in order to look into this issue further. I get about an hour or two before the node crashes, and I am running a 2gb RAM, DO droplet (natively, not with Docker).

<--- Last few GCs --->

[16950:0x3ffa780]  1128012 ms: Scavenge 825.4 (894.8) -> 818.9 (900.3) MB, 158.7 / 0.2 ms  allocation failure
[16950:0x3ffa780]  1134627 ms: Scavenge 831.1 (900.8) -> 820.3 (904.8) MB, 427.3 / 1.9 ms  allocation failure
[16950:0x3ffa780]  1142751 ms: Scavenge 835.2 (904.8) -> 821.6 (905.8) MB, 333.5 / 4.5 ms allocation failure
[16950:0x3ffa780]  1149258 ms: Scavenge 836.8 (906.3) -> 826.9 (2788.3) MB, 412.4 / 1.1 ms  allocation failure

<--- JS stacktrace --->
Cannot get stack trace in GC.
FATAL ERROR: NewSpace::Rebalance Allocation failed - process out of memory
 1: node::Abort() [decentraland-node]
 2: 0x12b82ac [decentraland-node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [decentraland-node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [decentraland-node]
 5: 0xa98e0b [decentraland-node]
 6: v8::internal::MarkCompactCollector::EvacuateNewSpaceAndCandidates() [decentraland-node]
 7: v8::internal::MarkCompactCollector::CollectGarbage() [decentraland-node]
 8: v8::internal::Heap::MarkCompact() [decentraland-node]
 9: v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags)
10: v8::internal::Heap::CollectGarbage(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, char const*, v8::GCCallbackFlags) [decentraland-node]
11: v8::internal::Factory::NewStruct(v8::internal::InstanceType) [decentraland-node]                 
12: v8::internal::Factory::NewContextExtension(v8::internal::Handle<v8::internal::ScopeInfo>, v8::internal::Handle<v8::internal::Object>) [decentraland-node]
13: v8::internal::Factory::NewCatchContext(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Handle<v8::internal::Context>, v8::internal::Handle<v8::internal::ScopeInfo>, v8::internal::Handle<v8::internal::String>, v8::internal::Handle<v8::internal::Object>) [decentraland-node]
14: v8::internal::Runtime_PushCatchContext(int, v8::internal::Object**, v8::internal::Isolate*) [decentraland-node]
15: 0x19552d7843a7
./bin/start: line 3: 16950 Aborted                 (core dumped) ./bin/decentraland-node --fast --port=2301 --prefix="data" --httpport=8301 --n=testnet --apikey=$RPC_API_KEY --contentport=9301 --startminer
thoragio commented 7 years ago

Further to @berrycoiled's observation here, I can confirm that the web server was not running when the crashes above occurred.

thoragio commented 7 years ago

Still crashing even after a reboot. Seems to be a memory leak given how RAM is getting used up.

image

CJHackerz commented 7 years ago

Yup server crashes after some time even on docker environment

CJHackerz commented 7 years ago

decentrelworld_server_crash_issue ![Uploading decentrelworld_server_crash_issue_term_out.PNG…]()

ghost commented 7 years ago

Have just observed this error on a Linux node with the latest code as of today 1f74a01b037c18f24ba4fc42e9e9d323ef1ccdc7. Happened after about an hour of runtime.

I see there was a Number exceeds 2^53-1 error immediately before. Unclear whether it was coincidental:

[warning] WARNING: Reorganizing chain.
[error] Number exceeds 2^53-1
{ AssertionError [ERR_ASSERTION]: Number exceeds 2^53-1
    at Object._readU64 (/home/berry/sft/land-bronze-new/lib/utils/encoding.js:71:3)
    at Object.readU64 (/home/berry/sft/land-bronze-new/lib/utils/encoding.js:85:19)
    at Object.readVarint (/home/berry/sft/land-bronze-new/lib/utils/encoding.js:470:24)
    at BufferReader.readVarint (/home/berry/sft/land-bronze-new/lib/utils/reader.js:489:25)
    at UndoCoin.fromReader (/home/berry/sft/land-bronze-new/lib/coins/undocoins.js:284:17)
    at Function.fromReader (/home/berry/sft/land-bronze-new/lib/coins/undocoins.js:321:25)
    at UndoCoins.fromRaw (/home/berry/sft/land-bronze-new/lib/coins/undocoins.js:100:30)
    at Function.fromRaw (/home/berry/sft/land-bronze-new/lib/coins/undocoins.js:112:26)
    at ChainDB.getUndoCoins (/home/berry/sft/land-bronze-new/lib/blockchain/chaindb.js:973:20)
    at getUndoCoins.next (<anonymous>)
  generatedMessage: false,
  name: 'AssertionError [ERR_ASSERTION]',
  code: 'ERR_ASSERTION',
  actual: false,
  expected: true,
  operator: '==' }
[info] Received 109 addrs (hosts=110, peers=7) (128.199.165.151:2301).
[info] Peer initialized compact blocks (128.199.165.151:2301).

<--- Last few GCs --->

[1811:0x55fa570c26c0]  4572679 ms: Mark-sweep 1414.3 (1561.6) -> 1414.2 (1531.1) MB, 1681.5 / 0.3 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1682 ms) last resort
[1811:0x55fa570c26c0]  4575162 ms: Mark-sweep 1414.2 (1531.1) -> 1414.0 (1531.1) MB, 2481.9 / 0.0 ms  last resort

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x10774939ba79 <JS Object>
    0: builtin exit frame: stringify(this=0x1077493acee9 <a JSON with map 0x81967f8bec1>,0x107749382241 <undefined>,0x262e27e76a19 <JS Function (SharedFunctionInfo 0x21a777492561)>,0x262e27e76f59 <an Object with map 0x52cfa8dda61>)

    1: _transform [/home/berry/sft/land-bronze-new/node_modules/ndjson/index.js:~28] [pc=0x34796a79e84d](this=0x3288155bcde9 <a DestroyableTransform with map 0x10d2...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [decentraland-node]
 2: 0x55fa560156de [decentraland-node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [decentraland-node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [decentraland-node]
 5: v8::internal::Factory::NewTransitionArray(int) [decentraland-node]
 6: v8::internal::TransitionArray::Insert(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::Map>, v8::internal::SimpleTransitionFlag) [decentraland-node]
 7: v8::internal::Map::CopyReplaceDescriptors(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::DescriptorArray>, v8::internal::Handle<v8::internal::LayoutDescriptor>, v8::internal::TransitionFlag, v8::internal::MaybeHandle<v8::internal::Name>, char const*, v8::internal::SimpleTransitionFlag) [decentraland-node]
 8: v8::internal::Map::CopyAddDescriptor(v8::internal::Handle<v8::internal::Map>, v8::internal::Descriptor*, v8::internal::TransitionFlag) [decentraland-node]
 9: v8::internal::Map::CopyWithField(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::FieldType>, v8::internal::PropertyAttributes, v8::internal::PropertyConstness, v8::internal::Representation, v8::internal::TransitionFlag) [decentraland-node]
10: v8::internal::Map::TransitionToDataProperty(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::PropertyConstness, v8::internal::Object::StoreFromKeyed) [decentraland-node]
11: v8::internal::LookupIterator::PrepareTransitionToDataProperty(v8::internal::Handle<v8::internal::JSObject>, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::StoreFromKeyed) [decentraland-node]
12: v8::internal::Object::AddDataProperty(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow, v8::internal::Object::StoreFromKeyed) [decentraland-node]
13: v8::internal::JSObject::AddProperty(v8::internal::Handle<v8::internal::JSObject>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes) [decentraland-node]
14: v8::internal::JsonStringifier::ApplyReplacerFunction(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [decentraland-node]
15: v8::internal::JsonStringifier::Result v8::internal::JsonStringifier::Serialize_<false>(v8::internal::Handle<v8::internal::Object>, bool, v8::internal::Handle<v8::internal::Object>) [decentraland-node]
16: v8::internal::JsonStringifier::Stringify(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [decentraland-node]
17: v8::internal::Builtin_JsonStringify(int, v8::internal::Object**, v8::internal::Isolate*) [decentraland-node]
18: 0x347968f2b17d
./bin/start: line 31:  1811 Aborted                 (core dumped) ./bin/decentraland-node --fast --loglevel=info --port=2301 --httpport=8301 --contentport=9301 --prefix="data" --n=testnet --apikey=$RPC_API_KEY --startminer=$START_MINER
berry@diamond:~/sft/land-bronze-new$
thoragio commented 7 years ago

I grepped my debug.log and there is not a single occurrence of the Number exceeds 2^53-1 error in it.

ghost commented 7 years ago

My system also shows steady increase in memory usage until it hits the physical memory limit, then a sudden drop presumably when it crashes. Think this supports the memory leak possibility.

noderamusage

hollowbit commented 7 years ago

I am having the same error on Windows. However, running it from the Git Bash environment, it works but crashes for another reason.

ghost commented 7 years ago

Interesting observation @vedi0boy. I confirm running from GitBash on Windows works. What is the error you see there?

hollowbit commented 7 years ago

I get this error: C:\Program Files\nodejs\node.exe: src\async-wrap.cc:619: Assertion(env()->current_async_id()) == (0)' failed.`

and when I access the web control panel, I get these errors that pop up in the git terminal:

Executing RPC call getblockchaininfo with params [] Executing RPC call getminerinfo with params [] Executing RPC call getnewaddress with params [] Getting mined tiles

[ERROR] Error trying to execute the rpc call getminerinfoconnect ECONNREFUSED 127.0.0.1:8301 [ERROR] Error trying to execute the rpc call getnewaddressconnect ECONNREFUSED 127.0.0.1:8301 [ERROR] Error trying to execute the rpc call getblockchaininfoconnect ECONNREFUSED 127.0.0.1:8301 [ERROR] Error trying to get the tiles. connect ECONNREFUSED 127.0.0.1:8301

It seems like there is no server starting on the 8301 port to serve the stats and tile data. But other than that, I can access the web control panel and I think that everything else is working, I just can't see my stats. But this is not the issue for this thread.

ghost commented 7 years ago

Have also seen the first one, mentioned in #45.

My Windows node running on Git Bash also shows a steady increase in memory usage like the others.

ghost commented 7 years ago

@thoragio Do you know the last commit where this wasn't happening? (The testnet v2 commit ?)

ghost commented 7 years ago

I believe thoragio had the testnet v2 commit running well for a while. Assuming that's true, this compare shows the changes from that commit to the last I tested that shows the memory leak.

https://github.com/decentraland/bronzeage-node/compare/3c987dab84245c38cbfc8118c76609e58ccda3e9...1f74a01b037c18f24ba4fc42e9e9d323ef1ccdc7

Something in these changes has caused a memory leak.

ghost commented 7 years ago

There are some changes around the miner. Has anyone tested for this without the miner running? @thoragio @CJHackerz @vedi0boy

ghost commented 7 years ago

Running with the miner off still shows the memory leak.

nodememoryleak

ghost commented 7 years ago

There's a new option --loglevel=info to the node so tried running without it. Same memory usage pattern.

ghost commented 7 years ago

Notably I see the node reports nothing but refilling peers for a long period while the memory usage rises. It's seemingly not processing blocks at all while this happens.

..
[info] Removed loader peer (104.131.97.112:2301).
[debug] Connecting to 217.63.132.166:2301.
[info] Setting loader peer (217.63.132.166:2301).
[info] Removed loader peer (217.63.132.166:2301).
[debug] Connecting to 174.138.56.173:2301.
[info] Setting loader peer (174.138.56.173:2301).
[debug] Refilling peers (7/8).
[debug] Connecting to 84.61.77.78:2301.
[debug] Refilling peers (5/8).
[debug] Connecting to 108.172.14.147:2301.
[debug] Connecting to 73.61.20.159:2301.
[debug] Connecting to 212.178.48.7:2301.
[debug] Refilling peers (5/8).
[debug] Connecting to 104.236.66.136:2301.
[debug] Connecting to 72.193.116.4:2301.
[debug] Connecting to 45.55.180.162:2301.
[debug] Refilling peers (7/8).
[debug] Connecting to 67.11.251.66:2301.
[info] Removed loader peer (174.138.56.173:2301).
[debug] Connecting to 186.138.98.37:2301.
[info] Setting loader peer (186.138.98.37:2301).
[debug] Refilling peers (6/8).
[debug] Connecting to 186.136.130.242:2301.
[debug] Connecting to 185.35.36.100:2301.
[debug] Refilling peers (7/8).
[debug] Connecting to 216.131.19.58:2301.
[debug] Refilling peers (5/8).
[debug] Connecting to 212.178.50.0:2301.
[debug] Connecting to 78.69.188.118:2301.
[debug] Connecting to 174.138.70.142:2301.
[info] Removed loader peer (186.138.98.37:2301).
[debug] Connecting to 104.236.82.120:2301.
[info] Setting loader peer (104.236.82.120:2301).
[debug] Refilling peers (7/8).
[debug] Connecting to 138.197.88.133:2301.
[info] Removed loader peer (104.236.82.120:2301).
[debug] Connecting to 67.79.169.251:2301.
[info] Setting loader peer (67.79.169.251:2301).
[debug] Refilling peers (6/8).
[debug] Connecting to 73.61.20.159:2301.
[debug] Connecting to 217.63.132.176:2301.
[debug] Refilling peers (6/8).
[debug] Connecting to 46.101.155.74:2301.
[debug] Connecting to 91.61.62.108:2301.
[debug] Refilling peers (7/8).
..

@thoragio Does yours look like this or do you see something different?

ghost commented 7 years ago

Running with web interface off, miner off, and without option --loglevel=info still shows the memory leak.

thoragio commented 7 years ago

@berrycoiled I can't remember the last commit where this was not happening. I did pull through the latest changes shortly after v2 testnet was started and I don't remember having this problem. Then I pulled through the changes from 12 July and that's when I noticed the problem.

I do see similar log entries to yours above, but I don't see my node sitting stuck on them for very long. Maybe a second or two and then more log entries come up after. But as they happen, my memory usage goes up by 1 or 2 megs after a couple of the Refilling peers events have floated by.

Note the OOM errors below that I have grepped out of the debug.log.

[30946:0715/201046.185209:FATAL:memory_linux.cc(35)] Out of memory.
[30946:0715/201046.185209:FATAL:memory_linux.cc(35)] Out of memory.
[30946:0715/201046.185209:FATAL:memory_linux.cc(35)] Out of memory.
[30330:0715/200929.946828:FATAL:memory_linux.cc(35)] Out of memory.
[30330:0715/200929.946828:FATAL:memory_linux.cc(35)] Out of memory.
[30330:0715/200929.946828:FATAL:memory_linux.cc(35)] Out of memory.
/proc/self/exe: error while loading shared libraries: cannot allocate symbol search list: Cannot allocate memory
/proc/self/exe: error while loading shared libraries: cannot allocate symbol search list: Cannot allocate memory
/proc/self/exe: error while loading shared libraries: cannot allocate symbol search list: Cannot allocate memory
[31005:0715/200952.338976:FATAL:memory_linux.cc(35)] Out of memory.
[31005:0715/201000.771296:FATAL:memory_linux.cc(35)] Out of memory.
[31005:0715/200952.338976:FATAL:memory_linux.cc(35)] Out of memory.
[31005:0715/201000.771296:FATAL:memory_linux.cc(35)] Out of memory.
[31005:0715/200952.338976:FATAL:memory_linux.cc(35)] Out of memory.
[31005:0715/201000.771296:FATAL:memory_linux.cc(35)] Out of memory.
../../third_party/tcmalloc/chromium/src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
../../third_party/tcmalloc/chromium/src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
../../third_party/tcmalloc/chromium/src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
ghost commented 7 years ago

These modules had their versions changed. Tried running with the old versions. Same memory leak:

Very strange this bug.

ghost commented 7 years ago

I find a memory_linux.cc file in Chromium, for some reason in the Android version only:

https://android.googlesource.com/platform/external/chromium_org/+/a140131/base/process/memory_linux.cc

But possibly it's made it into Electron. That would make this a webtorrent memory leak.

In their issues they have dealt with several memory leaks: https://github.com/webtorrent/webtorrent/issues?utf8=%E2%9C%93&q=is%3Aissue%20memory%20leak

ghost commented 7 years ago

Just tried the testnet v2 commit that was previously working. It ran out of memory and crashed after less than an hour.

Possibly this is not a code problem at all, but something has changed on the network that none of the code handles well.

Some of the webtorrent memory leak issues were about large files. If larger than normal content has been published to the network recently, maybe it's just crashing webtorrent.

Possibly direction for a solution: reconfigure webtorrent to limit memory use, or in some other way that prevents running out of memory.

thoragio commented 7 years ago

Well, holding study now for more than 2 hours. The memory leak is not manifesting any longer, at least for now.

image

And I have changed absolutely nothing on my machine since it was last crashing consistently a couple of days ago. I literally just restarted the node (without a reboot even).

ghost commented 7 years ago

Confirmed that disabling torrent acquisition solves this. It synced the whole blockchain to ~20,000 blocks in about a minute, with <100MB memory usage.

It's just getting overloaded trying to download too many torrents at once. Think we can solve this by wrapping torrent acquisition in a queue.

ghost commented 7 years ago

Looked into tweaking something to fix this. WebTorrent is awful and can't be fixed. It spawns multiple Chromium process per torrent. There's apparently no way around this. It's not suitable for use as a persistent seeder.

The solution is to drop WebTorrent and move to IPFS.

thoragio commented 7 years ago

Long-term WT is still eating away at memory even though it is relatively stable compared to a few days ago:

image

hollowbit commented 7 years ago

At this point I think I will just wait until Iron Age to start mining.

thoragio commented 7 years ago

There will be no mining in Iron Age. Enjoy it in the Bronze Age while it lasts!! 🎉

hollowbit commented 7 years ago

If there is no mining than what is the point? It makes it seem like the devs are just trying to create some fake value in a virtual world in order to make money. I guess there is no point in me trying to mine anymore since my progress will be lost anyway.

eordano commented 7 years ago

Mining served two purposes:

We can hire Ethereum for the first point, it's way better than maintaining our own p2p network and miners. But we still need some way to decide on number 2. Our idea is make land expensive, and raising money for the initial setup of the project. Open to hear feedback on this, thanks!

hollowbit commented 7 years ago

I saw N.O.D.E.'s video on how you guys will start using the Ethereum network and I think it makes sense. Good luck on the project. I will keep tabs on it.

thoragio commented 7 years ago

I'm closing this issue because @berrycoiled has pinpointed this as an issue with WebTorrent, not the node itself.