GEMINI-Medicine / Rgemini

A custom R package that provides a variety of functions to perform data analyses with GEMINI data
https://gemini-medicine.github.io/Rgemini/
Other
3 stars 0 forks source link

Migrate mlaps #40

Closed vaakesan-SMH closed 9 months ago

vaakesan-SMH commented 9 months ago

Needs to be checked against the code review checklist.

Handling of suppressWarnings for min/max(c()) Inf/-Inf should ideally also be improved.

vaakesan-SMH commented 9 months ago

Note that although I've added a utility function (return_hospital_field()) to find the hospital field name, this is a hard check on the admdad table.

Ideally we'd use find_db_tablename() to find the name of the admdad table (some HPC datacuts use a different name for this table). However, I tried this implementation and tested it in report_db_v3, but there were multiple matches (admdad and admdad_temp3333), which caused an error.

Since this function will be primarily used with report_db, I've fixed the table to admdad. However, I wonder if we could make a change to find_db_tablename() so we can take a particular table if multiple are returned (haven't thought about how this would actually work)? For now this solution will be fine because HPC users will not necessarily need loop_mlaps(). Let me know your thoughts @gemini-wenb @loffleraSMH

gemini-wenb commented 9 months ago

Good point. The current find_db_tablename doesn't handle multiple matches when the matches are unique from each other.

@loffleraSMH do you know whether /how often that HPC users would have multiple datasets with names starting on the same string (i.e. admdad and admdad_temp3333)?

I wonder if multiple matches only happens when we search on the entire non-HPC dbs (e.g. report_db_v3). In this case, we can add a flag to find_db_tablename() for internal use to always return admdad.

This wouldn't work if HPC users can have multiple admdads because we won't be able to hard code which one to return without knowing the specifics diff in their naming variations.

loffleraSMH commented 9 months ago

I checked with Kieran and he confirmed that the only alternative for admdad on HPC is admdad_subset, so I think the best option is to hard-code those 2 options in find_db_tablename (similarly to what we do for the lab and transfusion tables). I just pushed a commit with the change I'm suggesting. Does this solve the issue?

vaakesan-SMH commented 9 months ago

I checked with Kieran and he confirmed that the only alternative for admdad on HPC is admdad_subset, so I think the best option is to hard-code those 2 options in find_db_tablename (similarly to what we do for the lab and transfusion tables). I just pushed a commit with the change I'm suggesting. Does this solve the issue?

Yup that resolves it. Pushed a commit to use find_db_tablename() now.

vaakesan-SMH commented 9 months ago

Thanks @loffleraSMH for the review. Pushed a commit with all the suggestions implemented. Please have a look and let me know if anything was missed (and re-test the function if possible).

vaakesan-SMH commented 9 months ago

I saw your note about "Handling of suppressWarnings for min/max(c()) Inf/-Inf should ideally also be improved" in the pull request. Let me know if you want me to help out with/review any additional changes you might make to this. More generally, the function currently returns a lot of warnings. Could it be useful to suppress all the NAs introduced by coercion warnings to avoid too much clutter in the output?

Missed this comment in my commit. I think that all coercion to NA warnings remaining should be coming from laps_assign_test(), which more generally needs a bit of a cleanup/refactor - but I am leaning towards moving that piece to a separate issue. @gemini-wenb thoughts?

gemini-wenb commented 9 months ago

I saw your note about "Handling of suppressWarnings for min/max(c()) Inf/-Inf should ideally also be improved" in the pull request. Let me know if you want me to help out with/review any additional changes you might make to this. More generally, the function currently returns a lot of warnings. Could it be useful to suppress all the NAs introduced by coercion warnings to avoid too much clutter in the output?

Missed this comment in my commit. I think that all coercion to NA warnings remaining should be coming from laps_assign_test(), which more generally needs a bit of a cleanup/refactor - but I am leaning towards moving that piece to a separate issue. @gemini-wenb thoughts?

Yes, the handling of supresswarning for min/max in the original comment has been handled. Remaining warnings is not due to min/max. I also think we should move laps_assign_test() cleanup as a separate issue.

loffleraSMH commented 9 months ago

I saw your note about "Handling of suppressWarnings for min/max(c()) Inf/-Inf should ideally also be improved" in the pull request. Let me know if you want me to help out with/review any additional changes you might make to this. More generally, the function currently returns a lot of warnings. Could it be useful to suppress all the NAs introduced by coercion warnings to avoid too much clutter in the output?

Missed this comment in my commit. I think that all coercion to NA warnings remaining should be coming from laps_assign_test(), which more generally needs a bit of a cleanup/refactor - but I am leaning towards moving that piece to a separate issue. @gemini-wenb thoughts?

Yes, the handling of supresswarning for min/max in the original comment has been handled. Remaining warnings is not due to min/max. I also think we should move laps_assign_test() cleanup as a separate issue.

Agree this can be moved to a separate (low-priority) issue. I'll take another quick look at the changes you made and will let you know once it's ready to be merged.

vaakesan-SMH commented 9 months ago

Agree this can be moved to a separate (low-priority) issue. I'll take another quick look at the changes you made and will let you know once it's ready to be merged.

Created a separate issue #55

loffleraSMH commented 9 months ago

Thanks @loffleraSMH for the review. Pushed a commit with all the suggestions implemented. Please have a look and let me know if anything was missed (and re-test the function if possible).

Thanks @vaakesan-SMH for making all those changes. Looks great! I just made a small fix to return_hospital_field() (see above). The only other thing I noticed is that the function takes forever to run on drm_cleandb_v2_1_1 (likely due to FDW, it's fine on drm_cleandb_v1_2_2), but we can leave it as is for now as that shouldn't be an issue in the future.

I approved the pull request so feel free to go ahead and merge now.

vaakesan-SMH commented 9 months ago

Thanks @vaakesan-SMH for making all those changes. Looks great! I just made a small fix to return_hospital_field() (see above). The only other thing I noticed is that the function takes forever to run on drm_cleandb_v2_1_1 (likely due to FDW, it's fine on drm_cleandb_v1_2_2), but we can leave it as is for now as that shouldn't be an issue in the future.

I approved the pull request so feel free to go ahead and merge now.

Right that's a good point. We actually had this discussion with Trong. mlaps will still be calculated by the DB team and available in derived_variables, and will generally be run on report_db before creating a drm_cleandb. For HPC datacuts I believe the FDW isn't an issue, is that correct?

The only issue would be if someone wants to run a custom configuration of mlaps (component_wise or hours_after_admission) on their cohort, but really in that case they could just use the mlaps function directly instead of the loop_mlaps wrapper. The wrapper was basically designed for the DB team to run the function on the entirety of report_db.

loffleraSMH commented 9 months ago

Thanks @vaakesan-SMH for making all those changes. Looks great! I just made a small fix to return_hospital_field() (see above). The only other thing I noticed is that the function takes forever to run on drm_cleandb_v2_1_1 (likely due to FDW, it's fine on drm_cleandb_v1_2_2), but we can leave it as is for now as that shouldn't be an issue in the future. I approved the pull request so feel free to go ahead and merge now.

Right that's a good point. We actually had this discussion with Trong. mlaps will still be calculated by the DB team and available in derived_variables, and will generally be run on report_db before creating a drm_cleandb. For HPC datacuts I believe the FDW isn't an issue, is that correct?

The only issue would be if someone wants to run a custom configuration of mlaps (component_wise or hours_after_admission) on their cohort, but really in that case they could just use the mlaps function directly instead of the loop_mlaps wrapper. The wrapper was basically designed for the DB team to run the function on the entirety of report_db.

I'm honestly not sure about HPC (I think the datacuts use FDWs too?), but yes, great point about running mlaps function directly without wrapper when re-deriving mLAPS for smaller cohorts. So yes, I think that's good for now!