ericbuckley commented 5 months ago

Pull Request

Description

Use Synthea a synthetic dataset for running performance tests.

Related Issues

closes #10

Additional Notes

This was much simpler than I thought, only needed a couple of changes here to set up the parameters and run the synthea command. My questions for the reviewers are, should we parameterize any more options for running performance tests?

Checklist

Please review and complete the following checklist before submitting your pull request:

[x] I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
[x] I have reviewed my changes to ensure they are clear, concise, and well-documented.
[x] I have updated the documentation, if applicable.
[ ] I have added or updated test cases to cover my changes, if applicable.
[ ] I have minimized the number of reviewers to include only those essential for the review.
[ ] I have notified teammates in the review thread to build awareness.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

[ ] The code follows best practices and conventions.
[ ] The changes implement the desired functionality or fix the reported issue.
[ ] The tests cover the new changes and pass successfully.
[ ] Any potential edge cases or error scenarios have been considered.

ericbuckley commented 5 months ago

cc @derekadombek @cbrinson-rise8

ericbuckley commented 5 months ago

@alhayward apologies for the delay, I missed this comment the other day.

Thanks for this! Couple questions:

* What parameters did the DIBBs team use in their final synthetic dataset before gaining LAC production data? We should solicit their recommendations/feedback. (Curious also if we have access to any previous LAC data.)

I'm not :100: sure on that, but I did use the same parameters that Marcelle and Daniel setup in their Synthea generate and load script. I'm assuming they used parameters similar to this, if not exactly the same, but I agree it be worth asking them to confirm in this PR (will do shortly).

* Timeboxing with the knowledge that production data may look significantly different, the question of whether we should include additional parameters when generating the synthetic patient population should be driven by: What is the aim of this performance testing? Some considerations:

  * Do we intend to vary any parameters beyond size? Are we interested in testing the scalability of the algorithm in any other characteristics of patient population (e.g., vary number of duplicates, vary population distributions of fields used for linking, etc.)?
  * If we are focused on **load testing**, tuning population parameters may be less important, as long as the distributions remain representative of the test population as N increases.

My initial intentions were to focus on database scalability, specifically the relationship of API latency (i.e. how long the client waits for a response from the API) to the number of Patient records.

  * By default, a synthetic dataset for MA will be generated. Should we specify LAC (and any other parameters the DIBBs team used) to replicate the synthetic data used previously?

I think it makes sense to add an option for the location, so I'll make that a variable. My initial research is telling me that there are only --location options for US States plus about 18 other countries. Did you see any documentation that would allow us to specify a locale smaller than a State?

  * Do we want to make use of the `exporter.split_records` or `exporter.split_records.duplicate_data` [properties](https://github.com/synthetichealth/synthea/blob/master/src/main/resources/synthea.properties#L12-L13), to simulate duplicate data?

I'm guessing the thought here is so that some of our API requests result in matches? If so, I think including that option makes sense.

  * Should we consider `generate.append_numbers_to_person_names = false`, since the DIBBs algorithm expects name strings with no digits?

I've run a few tests so far with that set to true and it seems to work fine. Did you read something that indicates the algorithm won't work with numbers in names?

Down to chat further!

ericbuckley commented 5 months ago

  * By default, a synthetic dataset for MA will be generated. Should we specify LAC (and any other parameters the DIBBs team used) to replicate the synthetic data used previously?
I think it makes sense to add an option for the location, so I'll make that a variable. My initial research is telling me that there are only --location options for US States plus about 18 other countries. Did you see any documentation that would allow us to specify a locale smaller than a State?

Ignore this, I see the path forward, it's not an option it's an argument!

ericbuckley commented 5 months ago

I'm going to close this PR. As I started working on #11 and #12, a fair amount changed. So I'm just going to combine #10, #11 & #12 into 1 PR.

CDCgov / IDWA

RLPT generate synthetic data #36

Pull Request

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers