MoH-Malaysia / covid19-public

Official data on the COVID-19 epidemic in Malaysia. Powered by CPRC, CPRC Hospital System, MKAK, and MySejahtera.
Other
967 stars 651 forks source link

linelist_deaths.csv: Possible data entry error for age #175

Closed thamron closed 2 years ago

thamron commented 2 years ago
library(tidyverse)
filename <- "https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv"
mydata <- read_csv(file = filename)
#> Rows: 21124 Columns: 11
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr  (2): vaxtype, state
#> dbl  (5): age, male, bid, malaysian, comorb
#> date (4): date, date_positive, date_dose1, date_dose2
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
mydata
#> # A tibble: 21,124 x 11
#>    date       date_positive date_dose1 date_dose2 vaxtype state        age  male
#>    <date>     <date>        <date>     <date>     <chr>   <chr>      <dbl> <dbl>
#>  1 2020-03-17 2020-03-12    NA         NA         <NA>    Johor         34     1
#>  2 2020-03-17 2020-03-14    NA         NA         <NA>    Sarawak       60     1
#>  3 2020-03-20 2020-03-11    NA         NA         <NA>    Sabah         58     1
#>  4 2020-03-21 2020-03-17    NA         NA         <NA>    Kelantan      69     1
#>  5 2020-03-21 2020-03-13    NA         NA         <NA>    Melaka        50     1
#>  6 2020-03-21 2020-03-21    NA         NA         <NA>    Sarawak       39     0
#>  7 2020-03-21 2020-03-14    NA         NA         <NA>    W.P. Kual~    57     1
#>  8 2020-03-22 2020-03-18    NA         NA         <NA>    Perlis        48     1
#>  9 2020-03-22 2020-03-14    NA         NA         <NA>    Pulau Pin~    73     1
#> 10 2020-03-22 2020-03-20    NA         NA         <NA>    Sarawak       80     0
#> # ... with 21,114 more rows, and 3 more variables: bid <dbl>, malaysian <dbl>,
#> #   comorb <dbl>
ggplot(data = mydata, aes(x = age)) +
  geom_histogram(binwidth = 5)

mydata %>% 
  count(age) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 108 x 3
#>      age     n      prop
#>    <dbl> <int>     <dbl>
#>  1    -1     2 0.0000947
#>  2     0     3 0.000142 
#>  3     1     6 0.000284 
#>  4     2     5 0.000237 
#>  5     3     3 0.000142 
#>  6     4     3 0.000142 
#>  7     5     2 0.0000947
#>  8     6     1 0.0000473
#>  9     7     2 0.0000947
#> 10     8     5 0.000237 
#> # ... with 98 more rows
mydata %>% 
  mdsr::skim(age)

Variable type: numeric

var n na mean sd p0 p25 p50 p75 p100
age 21124 0 60.51 15.8 -1 50 61 72 130

Created on 2021-09-14 by the reprex package (v2.0.1)

MoH-Malaysia commented 2 years ago

Thank you - the issue is fixed.