BRANCHlab / metasnf

Scalable subtyping with similarity network fusion
https://branchlab.github.io/metasnf/
Other
5 stars 0 forks source link

outcome data format in vignette #4

Closed apdlbalb closed 10 months ago

apdlbalb commented 10 months ago

Hey Prashanth!

It's a unclear in the vignette that there needs to be a UID column in your outcome list data frames.

Also! It seems that extend_om assumes your UID column is called subjectkey. generate_outcome_list allows me to specify the UID column, but extend_om doesn't run unless there is a column called subjectkey?

pvelayudhan commented 10 months ago

Thanks for flagging this! I never ended up reaching a way to keep track of which observation was which that I was particularly happy with.

There is no real reason for subjectkey to be a default - it was just the UID column in the dataset I was originally working with. I'll work on having the relevant functions always request a "UID" parameter to be specified and the vignette appropriately updated.

Some day, this code will get rewritten with OOP and distinguishing the UID from everything else won't need to be so ugly.

pvelayudhan commented 10 months ago

It looks like the usage of "subjectkey" as the UID column is pretty deeply intertwined with the rest of the code's functionality, so making any UID column name allowed without potentially botching things may be a more appropriate task to work on in tandem with an OOP rewrite. This is also a consequence of the package not really having any well written tests at this time. Lots to do!

I have noticed that generate_outcome_list does not have any particular parameter for UID. Initializing an outcome list with data that doesn't have subjectkey columns just yields an error:

library(metasnf)

# Replacing UID column name with "patient" instead of "subjectkey"
abcd_anxiety <- abcd_anxiety |> dplyr::rename("patient" = "subjectkey")
abcd_depress <- abcd_depress |> dplyr::rename("patient" = "subjectkey")

# Error:
outcome_list <- generate_outcome_list(
    list(abcd_anxiety, "anxiety", "ordinal"),
    list(abcd_depress, "depressed", "ordinal")
)

Were you perhaps referring to generate_data_list? Which does have a parameter old_uid. This parameter is meant to convert the data's original UID colum name into "subjectkey", but I realize this is not very clear from the function's documentation or the vignette.

For now, instead of allowing flexible UID column names I'm going to update the major functions and vignette to make it more clear that at this time, the UID column name will need to be "subjectkey". The points during the analyses where new data are introduced (generate_data_list and generate_outcome_list will have functionality to convert any non-subjectkey UIDs to being titled "subjectkey" and will be more explicit about this behaviour.

Allowing the user to have their own UID column names remain undisturbed will be a high priority issue but one that may take a bit more time.

pvelayudhan commented 10 months ago

I'm tentatively marking this as resolved after the last commit 965e143. It should be more explicit now what the generating functions request from the user: specify the old_uid parameter anytime data is being supplied that isn't already using "subjectkey" as the UID column. The sample data used to go through the vignette now no longer starts off with having "subjectkey" as the UID, so working through the vignette will give a more realistic experience outlining how to deal with data that doesn't have "subjectkey" as UID by default.

The time it would take to remove dependency on having one fixed UID column name seems a little bit excessive right now, and I think will be best implemented during an OOP rewrite (which may be soon?)

Let me know if you prefer an alternate solution!