GEMINI-Medicine / Rgemini

A custom R package that provides a variety of functions to perform data analyses with GEMINI data
https://gemini-medicine.github.io/Rgemini/
Other
3 stars 0 forks source link

Using temp tables and EXISTS in functions that query the database #129

Open wankiewiczm-smh opened 1 month ago

wankiewiczm-smh commented 1 month ago

Enhancement for functions that query the database

In functions that query the database, it's more efficient to use a temp table along with EXISTS, rather than pasting a list of genc_id and querying the database.

The current method used in functions like readmission pastes a list of genc_id to query from the database, which can be slow and cause memory issues for postgres.

An example of what a more efficient query looks like is:

SELECT genc_id, diagnosis_code, diagnosis_type
FROM [dynamic_table_name]
WHERE EXISTS (
    SELECT 1
    FROM temp_readmission_data trd
    WHERE trd.genc_id = [dynamic_table_name].genc_id
);

This query is more efficient because the EXISTS clause only checks for the existence of a row, rather than needing to fetch and return the actual data.

Functions that can be updated to use more efficient queries include:

loffleraSMH commented 1 month ago

Just talked to @Trong-SMH about this too. He suggested to also apply chunking to query data separately for individual hospitals. Just keep in mind that this should be done by hospital_num so it will work for HPC users as well.