KTH-Library / kthcorpus

R package to support workflows related to the corpus of publications from KTH
https://kth-library.github.io/kthcorpus
GNU Affero General Public License v3.0
0 stars 1 forks source link

scbs & peopleList fields are empty in swecris. why? #104

Closed mohabazzi closed 11 months ago

mskyttner commented 11 months ago

The CSV is a flat format. A nested data frame with involved people can be produced like this:

people <- 
  swecris::swecris_kth |> 
  mutate(ip = pmap(list(InvolvedPeople), swecris::parse_involved_people, .progress = TRUE)) |> 
  select(ProjectId, ip)
> people
# A tibble: 3,137 × 2
   ProjectId          ip                
   <chr>              <list>            
 1 2021-00157_VR      <spc_tbl_ [1 × 6]>
 2 2022-00901_VR      <spc_tbl_ [1 × 6]>
 3 2022-01079_VR      <spc_tbl_ [2 × 6]>
 4 2022-01624_Vinnova <df [0 × 0]>      
 5 2022-01905_VR      <spc_tbl_ [3 × 6]>
 6 2022-02413_Vinnova <spc_tbl_ [1 × 6]>
 7 2022-02863_VR      <spc_tbl_ [1 × 6]>
 8 2022-03138_VR      <spc_tbl_ [1 × 6]>
 9 2022-02871_VR      <spc_tbl_ [1 × 6]>
10 2022-02855_VR      <spc_tbl_ [1 × 6]>
# ℹ 3,127 more rows
# ℹ Use `print(n = ...)` to see more rows

This can be unnested like this:

> people |> tidyr::unnest_longer("ip", simplify = TRUE)

# A tibble: 3,517 × 2
   ProjectId          ip$personId $fullName           $orcId              $roleEn                $roleSv      $gender
   <chr>                    <dbl> <chr>               <chr>               <chr>                  <chr>        <chr>  
 1 2021-00157_VR            52223 Peter Hedström      0000-0003-1102-4342 Principal Investigator Projektleda… Male   
 2 2022-00901_VR            65485 Cecilia Williams    0000-0002-0602-2062 Principal Investigator Projektleda… Female 
 3 2022-01079_VR            54477 Erik Fransén        0000-0003-0281-9450 Principal Investigator Projektleda… Male   
 4 2022-01079_VR            58070 Seth Grant          0000-0001-8732-8735 Co-Investigator        Projektdelt… Male   
 5 2022-01905_VR            53112 Per Högselius       0000-0001-9687-1940 Principal Investigator Projektleda… Male   
 6 2022-01905_VR            53506 Aliaksandr Piahanau NA                  Co-Investigator        Projektdelt… Male   
 7 2022-01905_VR            64819 Marta Musso         0000-0002-3728-3548 Co-Investigator        Projektdelt… Female 
 8 2022-02413_Vinnova       46352 Magnus Wiktorsson   NA                  Principal Investigator Projektleda… Male   
 9 2022-02863_VR            51079 Tomas Rosén         0000-0002-2346-7063 Principal Investigator Projektleda… Male   
10 2022-03138_VR            63763 Seraina Anne Dual   0000-0001-6867-8270 Principal Investigator Projektleda… Female 
# ℹ 3,507 more rows
# ℹ Use `print(n = ...)` to see more rows

So in a CSV this could be a separate table (swecris_projects_people.csv) or one could collapse multi-valued fields using a field separator.

mskyttner commented 11 months ago

@mohabazzi something like this could be used to write these two additional tables to the projects bucket:

people <- 
  swecris::swecris_kth |> 
  mutate(ip = pmap(list(InvolvedPeople), swecris::parse_involved_people, .progress = TRUE)) |> 
  select(ProjectId, ip)

swecris_projects_people <- 
  people |> tidyr::unnest_longer("ip", simplify = TRUE) |> tidyr::unnest("ip")

scb_codes <- 
  swecris::swecris_kth |> 
  mutate(codes = pmap(list(Scbs), swecris::parse_scb_codes, .progress = TRUE)) |> 
  select(ProjectId, codes)

swecris_projects_codes <- 
  scb_codes |> 
  tidyr::unnest_longer("scb_code", indices_to = "id", simplify = TRUE) |> 
  tidyr::unnest("scb_code")

# unfortunately we cannot do this as the separator char is used not only as sep
scbs |> tidyr::separate("scb_sv_en", sep = ", ", into = c("scb_sv", "scb_en"))

# upload these two additional tables in parquet format

arrow::write_parquet(swecris_projects_codes, "/tmp/swecris_projects_codes.parquet")
arrow::write_parquet(swecris_projects_people, "/tmp/swecris_projects_people.parquet")

minioclient::mc_cp("/tmp/swecris_projects_codes.parquet", "kthb/projects", verbose = T)
minioclient::mc_cp("/tmp/swecris_projects_people.parquet", "kthb/projects", verbose = T)
mskyttner commented 11 months ago

Note that the swecris_kth data contains these two strangely encoded fields, which therefore could be part of a single CSV export, but those two fields contain strings which are difficult to make use of in a sql query when building a dashboard, for example, since one would have to unpack/unnest the structures on the fly with sql which is likely quite complicated (even more so since it is not a json string or something like that)...

> swecris::swecris_kth |> select(ProjectId, Scbs, InvolvedPeople)
# A tibble: 3,137 × 3
   ProjectId          Scbs                                                                             InvolvedPeople
   <chr>              <chr>                                                                            <chr>         
 1 2021-00157_VR      "¤¤¤ 1: Naturvetenskap, Natural Sciences, 103: Fysik, Physical Sciences, 10399:… ¤¤¤52223¤Pete…
 2 2022-00901_VR      "¤¤¤ 3: Medicin och hälsovetenskap, Medical and Health Sciences , 301: Medicins… ¤¤¤65485¤Ceci…
 3 2022-01079_VR      "¤¤¤ 1: Naturvetenskap, Natural Sciences, 102: Data- och informationsvetenskap … ¤¤¤54477¤Erik…
 4 2022-01624_Vinnova "¤¤¤ 1: Naturvetenskap, Natural Sciences, 106: Biologi (Medicinska tillämpninga… NA            
 5 2022-01905_VR      "¤¤¤ 5: Samhällsvetenskap, Social Sciences, 502: Ekonomi och näringsliv, Econom… ¤¤¤53112¤Per …
 6 2022-02413_Vinnova "¤¤¤ 2: Teknik, Engineering and Technology, 202: Elektroteknik och elektronik, … ¤¤¤46352¤Magn…
 7 2022-02863_VR      "¤¤¤ 1: Naturvetenskap, Natural Sciences, 103: Fysik, Physical Sciences, 10304:… ¤¤¤51079¤Toma…
 8 2022-03138_VR      "¤¤¤ 2: Teknik, Engineering and Technology, 202: Elektroteknik och elektronik, … ¤¤¤63763¤Sera…
 9 2022-02871_VR      "¤¤¤ 1: Naturvetenskap, Natural Sciences, 104: Kemi, Chemical Sciences, 10402: … ¤¤¤62885¤Nann…
10 2022-02855_VR      "¤¤¤ 1: Naturvetenskap, Natural Sciences, 104: Kemi, Chemical Sciences, 10403: … ¤¤¤65735¤Eric…