Closed mohabazzi closed 11 months ago
@mohabazzi something like this could be used to write these two additional tables to the projects bucket:
people <-
swecris::swecris_kth |>
mutate(ip = pmap(list(InvolvedPeople), swecris::parse_involved_people, .progress = TRUE)) |>
select(ProjectId, ip)
swecris_projects_people <-
people |> tidyr::unnest_longer("ip", simplify = TRUE) |> tidyr::unnest("ip")
scb_codes <-
swecris::swecris_kth |>
mutate(codes = pmap(list(Scbs), swecris::parse_scb_codes, .progress = TRUE)) |>
select(ProjectId, codes)
swecris_projects_codes <-
scb_codes |>
tidyr::unnest_longer("scb_code", indices_to = "id", simplify = TRUE) |>
tidyr::unnest("scb_code")
# unfortunately we cannot do this as the separator char is used not only as sep
scbs |> tidyr::separate("scb_sv_en", sep = ", ", into = c("scb_sv", "scb_en"))
# upload these two additional tables in parquet format
arrow::write_parquet(swecris_projects_codes, "/tmp/swecris_projects_codes.parquet")
arrow::write_parquet(swecris_projects_people, "/tmp/swecris_projects_people.parquet")
minioclient::mc_cp("/tmp/swecris_projects_codes.parquet", "kthb/projects", verbose = T)
minioclient::mc_cp("/tmp/swecris_projects_people.parquet", "kthb/projects", verbose = T)
Note that the swecris_kth data contains these two strangely encoded fields, which therefore could be part of a single CSV export, but those two fields contain strings which are difficult to make use of in a sql query when building a dashboard, for example, since one would have to unpack/unnest the structures on the fly with sql which is likely quite complicated (even more so since it is not a json string or something like that)...
> swecris::swecris_kth |> select(ProjectId, Scbs, InvolvedPeople)
# A tibble: 3,137 × 3
ProjectId Scbs InvolvedPeople
<chr> <chr> <chr>
1 2021-00157_VR "¤¤¤ 1: Naturvetenskap, Natural Sciences, 103: Fysik, Physical Sciences, 10399:… ¤¤¤52223¤Pete…
2 2022-00901_VR "¤¤¤ 3: Medicin och hälsovetenskap, Medical and Health Sciences , 301: Medicins… ¤¤¤65485¤Ceci…
3 2022-01079_VR "¤¤¤ 1: Naturvetenskap, Natural Sciences, 102: Data- och informationsvetenskap … ¤¤¤54477¤Erik…
4 2022-01624_Vinnova "¤¤¤ 1: Naturvetenskap, Natural Sciences, 106: Biologi (Medicinska tillämpninga… NA
5 2022-01905_VR "¤¤¤ 5: Samhällsvetenskap, Social Sciences, 502: Ekonomi och näringsliv, Econom… ¤¤¤53112¤Per …
6 2022-02413_Vinnova "¤¤¤ 2: Teknik, Engineering and Technology, 202: Elektroteknik och elektronik, … ¤¤¤46352¤Magn…
7 2022-02863_VR "¤¤¤ 1: Naturvetenskap, Natural Sciences, 103: Fysik, Physical Sciences, 10304:… ¤¤¤51079¤Toma…
8 2022-03138_VR "¤¤¤ 2: Teknik, Engineering and Technology, 202: Elektroteknik och elektronik, … ¤¤¤63763¤Sera…
9 2022-02871_VR "¤¤¤ 1: Naturvetenskap, Natural Sciences, 104: Kemi, Chemical Sciences, 10402: … ¤¤¤62885¤Nann…
10 2022-02855_VR "¤¤¤ 1: Naturvetenskap, Natural Sciences, 104: Kemi, Chemical Sciences, 10403: … ¤¤¤65735¤Eric…
The CSV is a flat format. A nested data frame with involved people can be produced like this:
This can be unnested like this:
So in a CSV this could be a separate table (swecris_projects_people.csv) or one could collapse multi-valued fields using a field separator.