ellessenne / comorbidity

An R package for computing comorbidity scores.
https://ellessenne.github.io/comorbidity/
GNU General Public License v3.0
78 stars 21 forks source link

Allow spaces in id column name #25

Closed salmasian closed 4 years ago

salmasian commented 4 years ago

The way #20 was fixed does not allow for the id column's name to contain spaces.

dt <- data.frame(
  `Enc ID` = 1234,
  DxCode = 'N390'
)

comorbidity(dt, id = 'Enc ID', code = 'DxCode', icd = 'icd10', score = 'charlson', assign0 = F)

What you get is an error message:

Error in comorbidity(dt, id = "Enc ID", code = "DxCode", icd = "icd10",  : 
  1 assertions failed:
 * Variable 'id': Must be a subset of {'Enc.ID','DxCode'}, but is {'Enc ID'}.
ellessenne commented 4 years ago

Hi, Enc ID is not really a valid name for a data.frame (although it can be used with backticks), hence the error. I am not sure this should be fixed, might be worth adding a note to the documentation though?

salmasian commented 4 years ago

Well, technically, R does allow variable names with spaces :) but I agree with you that this may not be worth "fixing" here as much as it is worth updating the documentation.

ellessenne commented 4 years ago

The issue with that is that data.frame is called by default with check.names = TRUE, which calls the make.names function to make syntactically valid names:

data.frame(
  `Enc ID` = 1234,
  DxCode = 'N390')
#>   Enc.ID DxCode
#> 1   1234   N390
data.frame(
  `Enc ID` = 1234,
  DxCode = 'N390',
  check.names = FALSE)
#>   Enc ID DxCode
#> 1   1234   N390

Created on 2019-12-13 by the reprex package (v0.3.0)

The error you see is that the name of the input data.frame is not matching the name passed to comorbidity. However:

dt <- data.frame(
  `Enc ID` = 1234,
  DxCode = 'N390',
  check.names = FALSE)
comorbidity(dt, id = 'Enc ID', code = 'DxCode', icd = 'icd10', score = 'charlson', assign0 = F)
#> Error in parse(text = x, keep.source = FALSE): <text>:1:5: unexpected symbol
#> 1: Enc ID
#>         ^

Created on 2019-12-13 by the reprex package (v0.3.0)

This still does not work.

I now call the make.names function on both input dataset and column names to force them to be syntactically valid. The result is the following:

dt <- data.frame(
  `Enc ID` = 1234,
  DxCode = 'N390'
)
comorbidity(dt, id = 'Enc ID', code = 'DxCode', icd = 'icd10', score = 'charlson', assign0 = F)
#> Warning: The input 'id' string has been modified by make.names(). See ?
#> make.names() for more details.
#>   Enc.ID ami chf pvd cevd dementia copd rheumd pud mld diab diabwc hp rend canc
#> 1   1234   0   0   0    0        0    0      0   0   0    0      0  0    0    0
#>   msld metacanc aids score index wscore windex
#> 1    0        0    0     0     0      0      0

Created on 2019-12-13 by the reprex package (v0.3.0)

It now works!

As you can see, when names are changed a warning is triggered:

#> Warning: The input 'id' string has been modified by make.names(). See ?
#> make.names() for more details.

What do you think about all of the above @salmasian?

salmasian commented 4 years ago

This is similar to the normal R behavior (i.e. in other places, R also cleans up variable names) so I think it is is a worthwhile effort and I support that. The warning should not be suppressed.

ellessenne commented 4 years ago

Sounds good to me - I am actually the one triggering the warning 😃

I am closing this, and will submit to CRAN by the end of the year. Thanks again @salmasian!