tzaffi commented 1 year ago

Description

Allowing for moderate concurrency in the pipeline but without sacrificing its sequential integrity.

Summary of Changes

conduit/pipeline/common.go: introducing generic retrying of pipeline methods via Retries() and RetriesNoOutput()
conduit/pipeline/pipeline.go:
- error/cancellation handling:
- modify the pipeline's cancellation function to have a cause, and expose it via the WhyStopped() method
- joinError() instead of setError() to the pipeline's error property
- introducing goroutines for the importer, processors, exporters, and round launcher by refactoring Start() and introducing methods ImportHandler(), ProcessorHandler(), and ExporterHandler()
- E2E end of test signal: modified the info-level log at the end of a round to look like FINISHED Pipeline round: 110. UPDATED Pipeline round: 111 and added WARNING commentary to be careful about changing this format as it would break the E2E test.
conduit/pipeline/pipeline_bench_test.go - benchmarker for the pipeline that includes sleeping plugins with an importer, 2 processors, and an exporter.
pkg/cli/cli.go: remove a line break from the final error printout

Issues

118

TODO

[x] Cleanup E2E code
[x] Merge in #129
[x] Don't embed PipelineData inside of BlockData
[x] Remove OStart() and any resulting orphans in pipeline.go
[x] Merge latest from master which includes 30 second timeout #133
[x] ~Should we send a telemetry observation based on WhyStopped() ?~ NOT FOR NOW. Reconsider this for #97
[x] ~What should be logged and how should internal-tools/.../logstats.go be modified?~ This is under the umbrella of the "Logging Plugin Performance" thread
[x] ~Should we send metrics for the callbacks?~ Not in this PR
[x] Test on EC2 with the final changes in place.
[x] Update the results from that test in the spreadsheet table image below.

Testing

E2E

`pipeline_bench_test.go`

Running a new benchmark test twice on the original code and the new, we have the following results. Note the most pertinent results for the typical indexer DB population use case is exporter_10ms_while_others_1ms:

Benchmark Name	Original rounds/sec	Pipelining rounds/sec	Pipelining v Original (%)
vanilla_2_procs_without_sleep-size-1-8	3077	3309.5	+7%
uniform_sleep_of_10ms-size-1-8	22.32	79.815	+250%
exporter_10ms_while_others_1ms-size-1-8	63.405	78.565	+24%
importer_10ms_while_others_1ms-size-1-8	65.535	91.255	+39%
first_processor_10ms_while_others_1ms-size-1-8	60.28	89.175	+48%

Block Generator Results

Running the block generator test using SCENARIO = scenarios/config.allmixed.small.yml for 30s, with the original code and the new, each time for 2 experiments we have:

Reset database?	Original rounds/30 sec	Pipelining rounds/30 sec	Pipelining v Original (%)
Reset	301	400	+33%
No Reset	295	418	+41%

Local test network 5 minute sprint

I used the Justfile command

❯ just conduit-bootstrap-and-go 300

to bootstrap testnet and run a postgresql exporter against it for 300 seconds. I ran it a number of times against both the original pipeline and the new one. Here are the experimental results:

Log Level	Reps	Original rounds/300 sec (logs/round)	Pipelining rounds/300 sec (logs/round)	Pipelining v Original (%)
TRACE	3	3718 (7.0)	3509 (14.0)	-5.6% 😢
INFO	2	4578.5 (3.0)	4423.5 (3.0)	-3.4% 😢

On EC2 - CLASSIC vs. PIPELINING vs. 30 Second Timeout vs. FINAL

I ran catchup tests for 4 versions of conduit:

CLASSIC - what was on master at the time of the run
PIPELINING - the head of the pipelining branch at the same time
30 Second Timeout - the pipelining branch after a 30 second timeout was introduced in the algod importer
FINAL - commit https://github.com/algorand/conduit/pull/128/commits/867973f82e586f7d736e53e44eaf6d1f08871275 which is essentially the code meant for merging

There are much more detailed results in a google sheets document, but the summary is:

SUMMARY

codecov[bot] commented 1 year ago

Codecov Report

Merging #128 (9918073) into master (442791a) will increase coverage by 4.32%. Report is 52 commits behind head on master. The diff coverage is 81.89%.

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   67.66%   71.98%   +4.32%     
==========================================
  Files          32       36       +4     
  Lines        1976     2695     +719     
==========================================
+ Hits         1337     1940     +603     
- Misses        570      657      +87     
- Partials       69       98      +29

Files Changed	Coverage Δ
conduit/data/block_export_data.go	`100.00% <ø> (+92.30%)`	:arrow_up:
conduit/metrics/metrics.go	`100.00% <ø> (ø)`
conduit/pipeline/metadata.go	`69.11% <ø> (ø)`
conduit/plugins/config.go	`100.00% <ø> (ø)`
...duit/plugins/exporters/filewriter/file_exporter.go	`81.63% <ø> (-1.06%)`	:arrow_down:
conduit/plugins/importers/algod/metrics.go	`100.00% <ø> (ø)`
...gins/processors/filterprocessor/fields/searcher.go	`77.50% <ø> (ø)`
...ins/processors/filterprocessor/filter_processor.go	`83.82% <ø> (+3.54%)`	:arrow_up:
...plugins/processors/filterprocessor/gen/generate.go	`34.28% <ø> (ø)`
conduit/plugins/processors/noop/noop_processor.go	`64.70% <ø> (+6.81%)`	:arrow_up:
... and 20 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Eric-Warehime commented 1 year ago

If it's easy for you to run, it might be interesting to see Local test network 5 minute sprint using different log levels to show how logrus/logging in general is impacting processing times.