HelixNetwork / pendulum

Pendulum is a distributed messaging protocol that enables globally available tamper proof timestamps :hourglass_flowing_sand:
https://dev.hlx.ai
Other
10 stars 6 forks source link

Error appending to the db, and accessing the db while node is under load #97

Closed dnck closed 5 years ago

dnck commented 5 years ago

Expected Behavior

I expect the fullnode to be able to handle 10 transaction requests per second issued by javascript clients.

Current Behavior

After about 3000 transactions, the fullnode stops responding to the client requests.

Failure Information (for bugs)

There are two error messages in the log. One error comes from the Node class, and the other error comes from the LatestSolidMilestoneTrackerImpl class. It appears related to an error with "too many open files". Below, I provide the output of ulimit from my OS, which has a default limit to open files of 1024.

Steps to Reproduce

  1. Build two fresh fullnodes from the dev branch on the same computer.
  2. Configure the two fullnodes to be neighbors to one another.
  3. Make one fullnode a master issuing milestones.
  4. Make the other fullnode accept api transaction requests on an ip address in your lan.
  5. Spin up the two fullnodes.
  6. Go to a different computer in your lan.
  7. Launch 10 client proceses firing 1 transaction every 1000ms (script provided below). Peronsally, I just opened 10 terminal windows, and typed node valueless_spammer.js in each window.
  8. Wait for a bit, and witness the errors.

Context

Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.

Failure Logs

The two errors from the log are given below.

07/17 13:51:28.999 [Latest Solid Milestone Tracker] ERROR n.h.h.s.m.i.LatestSolidMilestoneTrackerImpl - error while updating the solid milestone
net.helix.hlx.service.milestone.MilestoneException: unexpected error while checking for new latest solid milestones
    at net.helix.hlx.service.milestone.impl.LatestSolidMilestoneTrackerImpl.trackLatestSolidMilestone(LatestSolidMilestoneTrackerImpl.java:163) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.LatestSolidMilestoneTrackerImpl.latestSolidMilestoneTrackerThread(LatestSolidMilestoneTrackerImpl.java:182) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.utils.thread.BoundedScheduledExecutorService.lambda$wrapTask$6(BoundedScheduledExecutorService.java:749) ~[helix-0.5.7.jar:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_201]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_201]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_201]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_201]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_201]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_201]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_201]
Caused by: net.helix.hlx.service.ledger.LedgerException: unexpected error while generating the StateDiff for milestone #142 (net.helix.hlx.model.TransactionHash@1b27314a)
    at net.helix.hlx.service.ledger.impl.LedgerServiceImpl.generateStateDiff(LedgerServiceImpl.java:307) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.ledger.impl.LedgerServiceImpl.applyMilestoneToLedger(LedgerServiceImpl.java:105) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.LatestSolidMilestoneTrackerImpl.applySolidMilestoneToLedger(LatestSolidMilestoneTrackerImpl.java:199) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.LatestSolidMilestoneTrackerImpl.trackLatestSolidMilestone(LatestSolidMilestoneTrackerImpl.java:153) ~[helix-0.5.7.jar:na]
    ... 9 common frames omitted
Caused by: net.helix.hlx.service.milestone.MilestoneException: error while updating the snapshotIndex of net.helix.hlx.controllers.TransactionViewModel@ca7cd8bc
    at net.helix.hlx.service.milestone.impl.MilestoneServiceImpl.updateMilestoneIndexOfSingleTransaction(MilestoneServiceImpl.java:411) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.MilestoneServiceImpl.updateMilestoneIndexOfMilestoneTransactions(MilestoneServiceImpl.java:391) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.MilestoneServiceImpl.updateMilestoneIndexOfMilestoneTransactions(MilestoneServiceImpl.java:136) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.ledger.impl.LedgerServiceImpl.generateStateDiff(LedgerServiceImpl.java:292) ~[helix-0.5.7.jar:na]
    ... 12 common frames omitted
Caused by: org.rocksdb.RocksDBException: While open a file for appending: db/000031.log: Too many open files
    at org.rocksdb.RocksDB.put(Native Method) ~[helix-0.5.7.jar:na]
    at org.rocksdb.RocksDB.put(RocksDB.java:472) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.storage.rocksDB.RocksDBPersistenceProvider.update(RocksDBPersistenceProvider.java:364) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.storage.Tangle.updatePersistenceProvider(Tangle.java:127) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.storage.Tangle.update(Tangle.java:120) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.controllers.TransactionViewModel.update(TransactionViewModel.java:180) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.controllers.TransactionViewModel.setSnapshot(TransactionViewModel.java:624) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.service.milestone.impl.MilestoneServiceImpl.updateMilestoneIndexOfSingleTransaction(MilestoneServiceImpl.java:409) ~[helix-0.5.7.jar:na]
    ... 15 common frames omitted
07/17 13:51:33.794 [pool-11-thread-4] ERROR net.helix.hlx.network.Node - Error accessing persistence store.
org.rocksdb.RocksDBException: While open a file for appending: db/000031.log: Too many open files
    at org.rocksdb.RocksDB.write0(Native Method) ~[helix-0.5.7.jar:na]
    at org.rocksdb.RocksDB.write(RocksDB.java:602) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.storage.rocksDB.RocksDBPersistenceProvider.saveBatch(RocksDBPersistenceProvider.java:285) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.storage.Tangle.saveBatch(Tangle.java:80) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.controllers.TransactionViewModel.store(TransactionViewModel.java:310) ~[helix-0.5.7.jar:na]
    at net.helix.hlx.network.Node.processReceivedData(Node.java:462) [helix-0.5.7.jar:na]
    at net.helix.hlx.network.Node.processReceivedDataFromQueue(Node.java:435) [helix-0.5.7.jar:na]
    at net.helix.hlx.network.Node.lambda$spawnProcessReceivedThread$8(Node.java:716) [helix-0.5.7.jar:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_201]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_201]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_201]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_201]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_201]
dzhelezov commented 5 years ago

This seems to be related to https://github.com/facebook/rocksdb/issues/164 and https://github.com/facebook/rocksdb/issues/4089

dnck commented 5 years ago

This seems to be related to facebook/rocksdb#164 and facebook/rocksdb#4089

Stopping the node, and rebooting computer resolves the error. Still not sure about the cause though.

oracle58 commented 5 years ago

Hm, may I see the script for the "value-less"-spammer, because it seems that value transactions are being spammed, based on some of the exceptions. It could be the case that it spends from the same seed, resulting in a double spend and inconsistent tips (the fix then would be using a different seed for each spammer).

In the case of spamming value transfers from the same seed, the consistent tips get referenced by inconsistent value transfers, which means that the old tips are removed from the tips list and there are no new consistent tips, which newly arriving txs can reference, probably resulting in the mentioned errors. In a production environment, nodes will be constantly receiving transactions, that will be consistent, thus resulting in tips that may be referenced. If we are spamming value transactions from the same seed on 10 machines, that is not a realistic "spam" test case, as no user would double spend purposely, unless issuing an attack.

What could be done to fix it now (if it was related to double spends), is that a consistent tip is not removed from the tips list, until it is referenced by a consistent transaction (this might create a lot of side issues though). But, as already mentioned, in production, the node will be receiving "normal" txs, so that the node would not halt, as there are always new consistent tips.

Odd thing is, that when I spammed value-less, even under way larger load (50 spam instances with 100ms rate), I never ran into this issue.

dzhelezov commented 5 years ago

Check this one and references therein: https://stackoverflow.com/questions/45510290/rocksdb-too-many-sst-files-of-very-small-size I believe the problem is that rocksDBs opens up too many files, it can be fixed by fine-tuning the config

oracle58 commented 5 years ago

Dan, could you share the configuration of the two nodes?

oracle58 commented 5 years ago

In the case of spamming value transfers from the same seed, the consistent tips get referenced by inconsistent value transfers, which means that the old tips are removed from the tips list and there are no new consistent tips, which newly arriving txs can reference, probably resulting in the mentioned errors. In a production environment, nodes will be constantly receiving transactions, that will be consistent, thus resulting in tips that may be referenced. If we are spamming value transactions from the same seed on 10 machines, that is not a realistic "spam" test case, as no user would double spend purposely, unless issuing an attack.

What could be done to fix it now (if it was related to double spends), is that a consistent tip is not removed from the tips list, until it is referenced by a consistent transaction (this might create a lot of side issues though). But, as already mentioned, in production, the node will be receiving "normal" txs, so that the node would not halt, as there are always new consistent tips.

Afterall, might rather be related to the issue, you/we previously had. Are the nodes you are running, in the same directory with different db path configs??

dnck commented 5 years ago

Hm, may I see the script for the "value-less"-spammer, valueless_spammer.js

const HELIX = require("@helixnetwork/core");
const CONVERTER = require("@helixnetwork/converter");
const HOST = "http://192.168.2.98:16000";
const SND = "0000000000000000000000000000000000000000000000000000000000000000";
const RCV = "0000000000000000000000000000000000000000000000000000000000000000";
const TOKENS = 0;
const PERIODICITY = 1000; //send msg every PERIODICITY ms
const DEPTH = 3;
const MINWEIGHTMAGNITUDE = 2;
var COMPOSER = HELIX.composeAPI({
provider: HOST
});
var stored_tx_bytes;
var transfer_object = {
address: RCV,
value: TOKENS,
message: CONVERTER.asciiToHBytes("Hello"),
tag: CONVERTER.asciiToHBytes("world!")
};
send_tx = function(error, response){
COMPOSER
.prepareTransfers(SND, [transfer_object])
.then(function(tx_bytes) {
stored_tx_bytes = tx_bytes;
return COMPOSER.sendHBytes(stored_tx_bytes, DEPTH, MINWEIGHTMAGNITUDE);
})
.then(results => console.log(JSON.stringify(results)))
.catch(err => {console.log(err);});
};
setInterval(send_tx, PERIODICITY);

Follower config.ini

[SBX]
API_HOST = 192.168.2.98
PORT = 16000
UDP_RECEIVER_PORT = 16100
TCP_RECEIVER_PORT = 16200
DEBUG = true
GRAPH_ENABLED = false
NEIGHBORS = udp://127.0.0.1:15100
HXI_DIR =  hxi
HEADLESS = true
DB_PATH = db
ZMQ_ENABLED = false
ZMQ_ENABLE_IPC = false
ZMQ_ENABLE_TCP = false
LOCAL_SNAPSHOTS_DEPTH = 5
MS_DELAY = 0
SPAM_DELAY = 0

Leader config.ini

[SBX]
API_HOST = localhost
PORT = 15000
UDP_RECEIVER_PORT = 15100
TCP_RECEIVER_PORT = 15200
DEBUG = true
GRAPH_ENABLED = false
NEIGHBORS = udp://127.0.0.1:16100
HXI_DIR =  hxi
HEADLESS = true
DB_PATH = db
ZMQ_ENABLED = true
ZMQ_ENABLE_IPC = false
ZMQ_ENABLE_TCP = true
ZMQ_PORT = 6550
LOCAL_SNAPSHOTS_DEPTH = 5
MS_DELAY = 5
SPAM_DELAY = 0
dnck commented 5 years ago

Are the nodes you are running, in the same directory with different db path configs??

Yes

dnck commented 5 years ago

I believe the problem is that rocksDBs opens up too many files, it can be fixed by fine-tuning the config

I was going to follow these suggestions: https://ro-che.info/articles/2017-03-26-increase-open-files-limit and see if they solved the problem. I am on an Ubuntu systemd, not headless, so that might be the issue.

dnck commented 5 years ago

Odd thing is, that when I spammed value-less, even under way larger load (50 spam instances with 100ms rate), I never ran into this issue.

What kind of OS was the fullnode running on? AWS? Or PC?

oracle58 commented 5 years ago

Are the nodes you are running, in the same directory with different db path configs??

Yes

I think thats the issue, tbh. I am not sure what is expected behavior here, but I usually run different instances from different directories for testing.

dnck commented 5 years ago

I think thats the issue, tbh. I am not sure what is expected behavior here, but I usually run different instances from different directories for testing.

Oh, I am sorry. I misread what you wrote. I am running the nodes in different directories, with different db paths.

oracle58 commented 5 years ago

Odd thing is, that when I spammed value-less, even under way larger load (50 spam instances with 100ms rate), I never ran into this issue.

What kind of OS was the fullnode running on? AWS? Or PC?

We did it locally on ubuntu and on aws/docker. I don't think thats the issue.

oracle58 commented 5 years ago

I think thats the issue, tbh. I am not sure what is expected behavior here, but I usually run different instances from different directories for testing.

Oh, I am sorry. I misread what you wrote. I am running the nodes in different directories, with different db paths.

Okay

oracle58 commented 5 years ago

Could you please share the db paths that you specified for each instance?

dnck commented 5 years ago

Although, I am doing something that perhaps might be worth mentioning.

I use a python script that compiles (using maven with all the tests and whatnot) one fullnode in the first directory A.

Then, the script copies the compiled target directory into a seperate location on the path outside of directory A.

dnck commented 5 years ago

Could you please share the db paths that you specified for each instance?

LEADER /home/hlx-dev/helix/testnet/fork0/testnet-1.0/db /home/hlx-dev/helix/testnet/fork0/testnet-1.0/spent-addresses-db

FOLLOWR /home/hlx-dev/helix/testnet/fork1/testnet-1.0/db /home/hlx-dev/helix/testnet/fork1/testnet-1.0/spent-addresses-db

oracle58 commented 5 years ago

Check this one and references therein: https://stackoverflow.com/questions/45510290/rocksdb-too-many-sst-files-of-very-small-size I believe the problem is that rocksDBs opens up too many files, it can be fixed by fine-tuning the config

I think that will just fix the specific issue, without solving the root issue (causing too many files to open). it is also worth noting, that limitations on max_open_files and other rocksdb configs are imposed in order to optimize performance.

I am still curious, why Dario and I aren't running into this issue, when stress testing. I will take some time and try to reproduce.

dnck commented 5 years ago

I am still curious, why Dario and I aren't running into this issue, when stress testing. I will take some time and try to reproduce.

Thank you!

oracle58 commented 5 years ago

Could you please share the db paths that you specified for each instance?

LEADER /home/hlx-dev/helix/testnet/fork0/testnet-1.0/db /home/hlx-dev/helix/testnet/fork0/testnet-1.0/spent-addresses-db

FOLLOWR /home/hlx-dev/helix/testnet/fork1/testnet-1.0/db /home/hlx-dev/helix/testnet/fork1/testnet-1.0/spent-addresses-db

Looks fine, but maybe, just for testing purposes, you could try removing this line: DB_PATH = db and let it use the default paths.

oracle58 commented 5 years ago

Btw, were you able to reproduce this error on an IOTA node (with their lib)? That would be quite interesting actually!

dnck commented 5 years ago

Looks fine, but maybe, just for testing purposes, you could try removing this line: DB_PATH = db and let it use the default paths.

Will do.

without solving the root issue (causing too many files to open)

Right, I have just now edited my /etc/systemd/user.conf and /etc/systemd/system.conf to include, DefaultLimitNOFILE=20000

But, this seems like it will only hide the problem, if there is one, with the java program.

Btw, were you able to reproduce this error on an IOTA node (with their lib)? That would be quite interesting actually!

I will try that today.