castoredc / castoRedc

R wrapper to Castor EDC API
MIT License
4 stars 8 forks source link

getStudyData prints random information #27

Closed reiniervlinschoten closed 9 months ago

reiniervlinschoten commented 9 months ago
castor_api$getStudyData(example_study)

Prints

study_data <- castor_api$getStudyData(creds$example_study, repeating_data_instances = T, survey_instances = T)

https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/CastorData.R#L303

reiniervlinschoten commented 9 months ago

@martijnkersloot, the getStudyData function throws warnings when adjusting types to R. Specifically this piece of code: https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/CastorData.R#L718-L723

When fields with user missings (e.g. "##USER_MISSING_98##") are cast to numeric, they become NA. Do you want to: 1) Keep this transformation of USER_MISSING to NA, and keep the warning 2) Keep the transformation of USER_MISSING to NA, and drop the warning 3) Create a new category of missings in R to explicitly handle user missings? I use -99 to -95 in the Python package

reiniervlinschoten commented 9 months ago

@martijnkersloot , the decision was made to be verbose in some functions, and it is unclear to me why. For example here https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/CastorData.R#L755

and here https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/utils.R#L6-L9

For now I have commented this out, because I do not think this is relevant information for the user. Is that okay?

reiniervlinschoten commented 9 months ago

@martijnkersloot, repeating data instances and survey instances are now saved as an attribute of the studyData that is returned by getStudyData

study_data <- castor_api$getStudyData(creds$example_study, repeating_data_instances = T, survey_instances = T)

https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/CastorData.R#L649

https://github.com/castoredc/castoRedc/blob/c0bf07b929e220683491f2584db5de5df0a6a2f7/R/CastorData.R#L661

Moreover, all repeating data instances are combined in the same dataframe, meaning that each repeating data instance has a column for each field that exists in any repeating data instance. So if you have 5 different repeating data forms, each row has its own fields, but also columns for the fields of the four others. This is quite unclear and leads people to have to clean up this dataframe themselves. Is it okay if I refactor this to a list of lists, as I have done in the R wrapper for my Python package?

Resulting in

list("StudyData" = study_data_df,
       "repeatingInstanceData" = list("Name1" = repeating_instance_one_df, "Name2" = repeating_instance_two_df,
       "surveyInstanceData" = list("Name1" = survey_instance_one_df, "Name2" = survey_instance_two_df)
martijnkersloot commented 9 months ago

Hi @reiniervlinschoten,


When fields with user missings (e.g. "##USER_MISSING_98##") are cast to numeric, they become NA. Do you want to:

  1. Keep this transformation of USER_MISSING to NA, and keep the warning
  2. Keep the transformation of USER_MISSING to NA, and drop the warning
  3. Create a new category of missings in R to explicitly handle user missings? I use -99 to -95 in the Python package

How would option no. 3 look like? I would say that that is the most elegant solution.


For now I have commented this out, because I do not think this is relevant information for the user. Is that okay?

Yes, of course! I think these were only used for debugging purposes.


Is it okay if I refactor this to a list of lists, as I have done in the R wrapper for my Python package?

Absolutely, that is how it was supposed to work from the beginning! Please make sure that this is listed as a breaking change though.

reiniervlinschoten commented 9 months ago

Hi @reiniervlinschoten,

When fields with user missings (e.g. "##USER_MISSING_98##") are cast to numeric, they become NA. Do you want to:

  1. Keep this transformation of USER_MISSING to NA, and keep the warning
  2. Keep the transformation of USER_MISSING to NA, and drop the warning
  3. Create a new category of missings in R to explicitly handle user missings? I use -99 to -95 in the Python package

How would option no. 3 look like? I would say that that is the most elegant solution.

For numeric variables, they would just take have values -99 to -95, and I would add to the manual that those are the user missing values. For character values, I would just keep them as a level of the factor.

reiniervlinschoten commented 9 months ago

Is fixed in v2.0.0