CDCgov / IDWA

Intelligent Data Workflow Automation
Apache License 2.0
1 stars 1 forks source link

Ericbuckley/idwa 71 analyze rl performance #83

Closed ericbuckley closed 3 months ago

ericbuckley commented 3 months ago

Pull Request

Description

Adding additional tools and code optimizations that were needed to analyze performance bottlenecks in the record linkage API.

Related Issues

closes #71

Additional Notes

Please take a look at some accompanying videos on finding the optimizations for more context.

seed data

A new script, seed_db.sh was added to preload the database with existing records before running the performance test. While the performance tests themselves do this, a seed file can vastly shorten the time necessary to run the tests if you want to test how an API performs with existing records in the MPI.

pgbadger

pgbadger was added to help analyze query locking and performance during the tests. To facilitate this analyze, some tooling needed to be added and changes made to the postgres configuration to capture the data necessary for analysis.

code optimizations

Changes were made to dal.py, mpi.py and link.py (with the help of an env variable flag) to test optimizations on potential bottlenecks. Additionally, the analyze_trace_timings.sh was added to analyze the results of performance test traces exported from jaeger. The changes have also been put into a phdi PR for the DIBBs team to review.

custom API health check

Added api_health_check.sh script to reduce the number of GET requests made to the API during the test.

synthea split option

Added an optional parameter for splitting synthea encounters into multiple files. This gives us the option to send multiple API requests for a patient if more than one encounter was generated by synthea.

Checklist

Please review and complete the following checklist before submitting your pull request:

Checklist for Reviewers

Please review and complete the following checklist during the review process:

ericbuckley commented 3 months ago

LGTM!

As someone who is newer to some of these tools, I appreciate the clear documentation and resources throughout.

It might be helpful to define explicitly somewhere in your docs the "performance" you are testing through this work. In the context of record linkage, performance could mean linkage performance (how accurately the algorithm links two records together), or performance in terms of run time and compute.

Good idea @alhayward, I'll update the README