Medical-Event-Data-Standard / meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard
Apache License 2.0
16 stars 3 forks source link

Is shuffling the tasks in the OMOP Flat to MEDS ETL necessary? #25

Closed scottfleming closed 1 month ago

scottfleming commented 1 month ago

Is the shuffling here really necessary?

If so, can we not set the seed for the user?

I'm generally against silently setting the seed and instead am in favor of the user setting it on their end if they want determinism. Setting the seed silently like this can result in nasty bugs when a function like this is embedded in a larger program that assumes the random seed isn't reset every time the function is called.

https://github.com/Medical-Event-Data-Standard/meds_etl/blob/9ec2f3e20ddbc01ceab999ccc9f4782e7485e73a/src/meds_etl/omop.py#L726

EthanSteinberg commented 1 month ago

The shuffling is helpful for making sure that if the code fails, it fails quickly.

I agree this is invalid, we should be using our own Python RNG object to avoid contamination global seed state. My apologies for the bug.

EthanSteinberg commented 1 month ago

@scottfleming https://github.com/Medical-Event-Data-Standard/meds_etl/pull/26 is a PR that will probably work, but I don't have the time to test it now.