OHDSI / PhenotypeLibrary

A repository to store, organize and maintain the content of the OHDSI Phenotype library. OHDSI Forum post https://forums.ohdsi.org/t/ohdsi-phenotype-library-announcements/16910
https://ohdsi.github.io/PhenotypeLibrary/
Other
41 stars 16 forks source link

Some json files are not valid UTF8 #56

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

Are the json files in the Phenotype library supposed to be valid UTF-8? If not what encoding is used?

library(PhenotypeLibrary)

listPhenotypes()$cohortId |>
  getPlCohortDefinitionSet() |>
  dplyr::filter(!validUTF8(json)) 
#> # A tibble: 5 × 4
#>   cohortId cohortName                                              json    sql  
#>      <dbl> <chr>                                                   <chr>   <chr>
#> 1        6 [P] Fever (3Pe, 30Era)                                  "{\n\t… "CRE…
#> 2       16 [P] Exposure to SARS-Cov 2 and coronavirus (7Pe, 30Era) "{\n\t… "CRE…
#> 3       29 [W] Autoimmune condition (FP)                           "{\n\t… "CRE…
#> 4       64 [P] Flu-like symptoms (3P, 30Era)                       "{\n\t… "CRE…
#> 5       73 [W] Pregnancy (270P, 0Era)                              "{\n\t… "CRE…

Created on 2023-06-20 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/London #> date 2023-06-20 #> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0) #> checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.2.0) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.2.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> fs 1.6.2 2023-04-25 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> PhenotypeLibrary * 3.2.0 2023-06-20 [1] local #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr 2.1.4 2023-02-10 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0) #> rmarkdown 2.22 2023-06-01 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.10.1 2023-06-05 [1] CRAN (R 4.2.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.2.0) #> vroom 1.6.3 2023-04-28 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
gowthamrao commented 1 year ago

Hi @ablack3 i am not an expert on this topic, so i will need some help from you to understand.

All it is doing is sourcing the cohort json from a security enabled atlas (atlas-phenotype.ohdsi.org) using this script https://github.com/OHDSI/PhenotypeLibrary/blob/main/extras/UpdatePhenotypes.R

So whatever format the ROhdsiWebapi gives it from that atlas - is saved.

Maybe using this information, you can help me understand what is UTF-8

Zhimin-arya commented 2 months ago

I encountered the same issue as well when I run generateCohortSet through all the cohorts, the R console threw the error message: invalid UTF-8 input in readChar(). My workaround is to manually open the corresponding cohorts json file with txt editor and save it again with UTF-8 encoding. The current cohorts which have utf-8 encoding issue are: 1029, 1215, 1229, 235, 29, 504, 633, 645. The PhenotypeLibrary version I am using is ‘3.32.0’.