Provide or reference articles about using caliper with fabric

davidkel commented 2 years ago

eg

comparing SDKs 2.2 vs peer gateway
Measuring Calipers overhead (workers)
Caliper in K8s
which rate controllers to use
how to get the max TPS
workers and how they relate to your env
writing appropriate workloads
tuning hyperledger fabric: eg peer and orderer config values

davidkel commented 2 years ago

General Fabric Guidance for K8s env

make sure your storage isn't a bottleneck, use fastest storage possible
performance generally scales with the amount of CPU you give the peer nodes, so describe your nodes and give the peers as much CPU as you can based on what's available.
Use Fabric 2.x as it has lots of performance improvements
The more Orderers you have will impact performance as all orderers are involved in Raft consensus, 3 will give you 1 orderer crash fault tolerance and 5 will give 2 orderer crash fault tolerance. So if you can accept the risk then you could reduce the number of orderers from say 5 to 3.
set send buffer size higher than the default (10)....we've seen decent results with 100, but your mileage may vary.
If using couch, check peer logs to see how long queries are taking and that there are no "should be indexed" warnings...index all queries you can and if queries are taking a long time, increase CPU/Memory available to couchdb. Also avoid couchdb queries that cannot make use of indexes, for example queries containing $or. Check fabric and couchdb documentation
always monitor you hyperledger fabric nodes (peers, orderers, chaincode containers) to ensure they aren't bottlenecked by any resource restrictions
for Orderers 1 CPU/2GB Ram is a good starting set of resources.
don't use couchdb for fabric unless you really have to. Look to using off-chain querying anyway by capturing block events to create your offline data. couchdb results in 3-5 times at least drop in throughput vs leveldb and can degrade further over time

davidkel commented 2 years ago

General Caliper Guidance

use remote workers, don't use local process workers
ensure caliper is running on a different system to the fabric network under test
Don't run caliper in a K8s environment, it's not well suited for that environment l, run it on VMs instead as it's easier. It may be possible to setup an appropriate environment using K8s but no investigation has been done and there is no benefit to running Caliper in K8s (apart from perhaps having K8s find nodes to run the manager/workers)

davidkel commented 2 years ago

Notes about fabric performance

The peer gateway can push more TPS through than 2.2 or 1.4 when using big machines to run
- 2.2 could achieve say 1000 TPS with 180 workers
- 2.4 could achieve 3000 TPS with 200 workers
I generally find I need at least 3 machines for caliper workers to 1 peer machine (more for 2.2, 1.4 bindings)
when using 2.2, 1.4 increasing the no. of workers increases the work a peer has to do to a point where trying to get the best TPS possible is being hindered by caliper's 1 Core capabilities such that to increase the pressure on a peer you have to add more workers which increases the number of client connections to a peer (plus for 2.2 we register a delivery client to receive blocks for transaction commit notification)
On a single channel system a peer is generally limited to 65-70% cpu max, you can't push it any further due to serialisation and locking. Multiple channels will allow a peer to use the extra resource however
Even on leveldb it appears that fabric performance starts to degrade as the system age grows. Need to determine if that is due to running time or no. of transactions processed and thus height of the blockchain
Orderer SendBufferSize default of 10 is too small and will cause an orderer bottleneck. Try 100 as a starting value
gateway request limit will also become a bottleneck as default is 500, I set it to 10000 and then 20000, it's good to note that on blind writes you can see the backlog of unfinished transactions being limited to 20000 but it will also register failed transactions as they are rejected. Also note that having a high number isn't always best. You can get better through put at lower numbers

some stats

nearly 20000 TPS on no-op evaluate

+----------------+---------+------+-----------------+-----------------+-----------------+-----------------+------------------+
| Name           | Succ    | Fail | Send Rate (TPS) | Max Latency (s) | Min Latency (s) | Avg Latency (s) | Throughput (TPS) |
|----------------|---------|------|-----------------|-----------------|-----------------|-----------------|------------------|
| no-op-evaluate | 2400120 | 118  | 19905.1         | 1.53            | 0.00            | 0.30            | 19903.6          |
+----------------+---------+------+-----------------+-----------------+-----------------+-----------------+------------------+
for 20000 concurrency limit
+----------------+---------+------+-----------------+-----------------+-----------------+-----------------+------------------+
| Name           | Succ    | Fail | Send Rate (TPS) | Max Latency (s) | Min Latency (s) | Avg Latency (s) | Throughput (TPS) |
|----------------|---------|------|-----------------|-----------------|-----------------|-----------------|------------------|
| no-op-evaluate | 2395775 | 4465 | 19897.2         | 11.28           | 0.00            | 0.93            | 19809.0          |
+----------------+---------+------+-----------------+-----------------+-----------------+-----------------+------------------+

For blind writes: note the backlog remained stable for this run (ie didn't grow gradually or exponentially so these TPS are sustainable over longer periods of time)

block_cut_time: 1s block_size: 50 preferred_max_bytes: 512 KB

+------------------+--------+------+-----------------+-----------------+-----------------+-----------------+------------------+
| Name             | Succ   | Fail | Send Rate (TPS) | Max Latency (s) | Min Latency (s) | Avg Latency (s) | Throughput (TPS) |
|------------------|--------|------|-----------------|-----------------|-----------------|-----------------|------------------|
| create-asset-100 | 360150 | 0    | 2996.2          | 2.04            | 0.28            | 0.73            | 2983.6           |
+------------------+--------+------+-----------------+-----------------+-----------------+-----------------+------------------+

Gateway Peer Max CPU: 60%, Max Memory: 5.16%, Max Disk: 51.1Mb/s For orderer: CPU max: 11%, Memory Max: 2.98%, Disk Max: 23.1 MB/s (For disk i/o I see spikes of 80MB/s which are not captured by prometheus due to 5s sampling I think)

Note I am deliberately not including any details about the machines being run as these are NOT to be considered as any sort of formal benchmark results. I will say that the machines are Baremetal running a single fabric process on each.

davidkel commented 2 years ago

what can be said about orderer parameters such a block cutting timeout and block triggering size of transaction number and max size etc ? what other parameters could affect a peer/orderer to improve performance or alter characteristics to suit a certain kind of load profile ?

davidkel commented 2 years ago

transaction throughput is significantly affected by payload size as well as ordering service settings.
You might want to try to configure the ordering service with more transactions per block and longer block cutting times to see if that helps. We have seen this increase the overall throughput at the cost of additional latency.
Your network throughput might also be a factor, particularly if your peer nodes are not running at very high CPU utilization.

davidkel commented 2 years ago

K8s specific

Ensure persistent storage IOPS is not a bottleneck. The persistent storage IO requests can be become a bottleneck in some solutions, 10 IOPS/GB+ is recommended.
Overall performance generally scales with the CPU allocated to the peer nodes. Giving the peer and couchDB (if used) as much CPU as will fit on the available worker nodes is recommended.
Use Fabric 2.X so that chaincode runs as a separate process.
Check CouchDB logs for warnings such as "The number of documents examined is high in proportion to the number of results returned. Consider adding a more specific index to improve this." Indexes for couchDB rich queries are essential for performance, and will be an indicator when too many documents are being scanned (full table scans) for relatively low number of results.
For Ordering system nodes, use monitoring to determine load and CPU pressure. Generally 1 CPU/2 GB RAM is enough to prevent ordering Service nodes from becoming a bottleneck.

The following three parameters work together to control when a block is cut, based on a combination of setting the maximum number of transactions in a block as well as the block size itself.

Absolute max bytes Set this value to the largest block size in bytes that can be cut by the orderer. No transaction may be larger than the value of Absolute max bytes. Usually, this setting can safely be two to ten times larger than your Preferred max bytes. Note: The maximum size permitted is 99MB.
Max message count Set this value to the maximum number of transactions that can be included in a single block.
Preferred max bytes Set this value to the ideal block size in bytes, but it must be less than Absolute max bytes. A minimum transaction size, one that contains no endorsements, is around 1KB. If you add 1KB per required endorsement, a typical transaction size is approximately 3-4KB. Therefore, it is recommended to set the value of Preferred max bytes to be around Max message count * expected averaged tx size. At run time, whenever possible, blocks will not exceed this size. If a transaction arrives that causes the block to exceed this size, the block is cut and a new block is created for that transaction. But if a transaction arrives that exceeds this value without exceeding the Absolute max bytes, the transaction will be included. If a block arrives that is larger than Preferred max bytes, then it will only contain a single transaction, and that transaction size can be no larger than Absolute max bytes. Together, these parameters can be configured to optimize throughput of your orderer.
Batch timeout

Set the Timeout value to the amount of time, in seconds, to wait after the first transaction arrives before cutting the block. If you set this value too low, you risk preventing the batches from filling to your preferred size. Setting this value too high can cause the orderer to wait for blocks and overall performance to degrade. In general, we recommend that you set the value of Batch timeout to be at least max message count / maximum transactions per second.

davidkel commented 2 years ago

More information to be consolidated

Ensure persistent storage IOPS is not a bottleneck. Persistent storage IO requests can become a bottleneck in some solutions. 10 IOPS/GB+ is recommended.
Performance generally scales with CPU allocated to peer nodes. Providing each peer and CouchDB (if used) with the maximum CPU capacity is recommended.
Increasing the ordering node sendBufferSize default of 10 can improve ordering service performance (100 is the recommended starting point).
Check CouchDB logs for warnings such as "The number of documents examined is high in proportion to the number of results returned. Consider adding a more specific index to improve this." Indexes for CouchDB rich queries are essential for performance, and will indicate when too many documents are being scanned (full table scans) for a relatively low number of results.
For ordering service nodes, use monitoring to determine load and CPU pressure. Generally, 1 CPU/2GB RAM prevents ordering service nodes from becoming a bottleneck.
A larger block size and timeout could increase the throughput (but latency would increase as well). The block size is highly related to the transaction arrival rate. For e.g., if the transaction arrival rate is 100 per sec and block size and timeout is 3000 & 10 secs, respectively, the throughput would be 10 tx per second. If the arrival rate is higher for the same configuration, the throughput would be higher.
1-out-N endorsement policy would be cheaper. To achieve maximum performance, try to reduce the number of sub-policies within an endorsement policy and the number of signatures required. If complex endorsement policy is must, run more peers per organization to load balance the endorsement request. This might not improve the throughput drastically though.
Deploying more than one channel would increase parallelism in block processing and hence would improve the performance. In general, do not have the number of channels > the number of CPU core. Again, it is not a hard limit. If the load in each channel is not highly correlated (i.e., not every channel is contenting for resources at the same time), we could have more channels than the number of CPU cores.
Using golevelDB over CouchDB would improve the throughput. If CouchDB is used for its query capability, use higher block size (related to bulk read/write API provided by CouchDB) to improve the performance.
Experiment by assigning different values for GOMAXPROCS (golang env) and ValidatorPoolSize in core.yaml. The value is highly correlated to the number of vCPUs, Assign a value greater than #vCPUs and see whether the performance improves. SSD is recommended for both ordering service and peer over HDD.
Throughput is proportional to the number of vCPUs allocated.
A network bandwidth of 1 Gbps is needed (again dependent on the application)

When using external CouchDB state database, read delays during endorsement and validation phases have historically been a performance bottleneck.
With Fabric v2.0, a new peer cache replaces many of these expensive lookups with fast local cache reads. The cache size can be configured by using the core.yaml property cacheSize

Prefer to not use rich queries in chaincode, use an offchain store for that but if you do make sure your queries are optimised and indexed (so avoid queries that can't be indexed)

Review the chaincode and add CouchDB indexes for queries:

If indexes are used, review the existing indexes and queries and fine-tune the queries to reduce the number of records returned.

Do not issue open ended or "count" queries.

Do not use the $regex , $in, $and etc

State database cache for improved performance on CouchDB - I doubt this will do anything for rich queries (need to check)

Block cutting parameters

The following three parameters work together to control when a block is cut, based on a combination of setting the maximum number of transactions in a block as well as the block size itself.

Absolute max bytes Set this value to the largest block size in bytes that can be cut by the orderer. No transaction may be larger than the value of Absolute max bytes. Usually, this setting can safely be two to ten times larger than your Preferred max bytes. Note: The maximum size permitted is 99MB.
Max message count Set this value to the maximum number of transactions that can be included in a single block.
Preferred max bytes Set this value to the ideal block size in bytes, but it must be less than Absolute max bytes. A minimum transaction size, one that contains no endorsements, is around 1KB. If you add 1KB per required endorsement, a typical transaction size is approximately 3-4KB. Therefore, it is recommended to set the value of Preferred max bytes to be around Max message count * expected averaged tx size. At run time, whenever possible, blocks will not exceed this size. If a transaction arrives that causes the block to exceed this size, the block is cut and a new block is created for that transaction. But if a transaction arrives that exceeds this value without exceeding the Absolute max bytes, the transaction will be included. If a block arrives that is larger than Preferred max bytes, then it will only contain a single transaction, and that transaction size can be no larger than Absolute max bytes. Together, these parameters can be configured to optimize throughput of your orderer.

Batch timeout

Set the Timeout value to the amount of time, in seconds, to wait after the first transaction arrives before cutting the block. If you set this value too low, you risk preventing the batches from filling to your preferred size. Setting this value too high can cause the orderer to wait for blocks and overall performance to degrade. In general, we recommend that you set the value of Batch timeout to be at least max message count / maximum transactions per second

davidkel commented 2 years ago

A further idea: fabric tries to do bulk update calls to couchdb to improve couchdb performance, this should be exploited however you may need to increase the bulk size if you have large transaction sizes (although large transaction sizes are a bad idea). "we increased the batch setting so huge blocks that were batched helped"

BashayerAlkalifah commented 4 months ago

er test can you explain how can i do this points ? 1) use remote workers, don't use local process workers 2)ensure caliper is running on a different system to the fabric network under test

davidkel / provision-performance