Feat/35 pacta sector override table

AlexAxthelm commented 4 months ago

Add function for creating override table to replace pacta.data.preparation::factset_manual_pacta_sector_override

Closes #35

Overall this is working like I would hope, except for one major issue, noted below.

The general plan is that similar to #45 and #39 , we migrate a static table from pacta.data.preparation to this repo, where it get pulled from the database, and we can be aware of any changes in the DB that affect how we use those tables.

In the case of this PR, we're looking at the PACTA sector override table, which we use to manually reclassify companies to. different sector (FS says this company is a financial company, but for PACTA purposes, it's an Automotive).

The function added in this PR takes a table of company names and PACTA sectors, and then interrogates the sym_entity table in the database to get the Factset entity ID, for use in override_sector_by_factset_entity_id(). Nothing too complicated.

The tricky bit here that I wasn't counting on when I went down this path is that I've found 2 companies in that table that have the same name, but different entity IDs

FDS=> select
  md5(factset_entity_id),
  entity_proper_name 
FROM fds.sym_v1_sym_entity
  WHERE entity_proper_name = 'China Huaneng Group Treasury Management (Hong Kong) Ltd.'
;
--               md5                |                    entity_proper_name
------------------------------------+----------------------------------------------------------
--049bb887d39df3d0f6b41f450f0186d3 | China Huaneng Group Treasury Management (Hong Kong) Ltd.
--94861e52fe0e1b3cc7a86d14c71d643b | China Huaneng Group Treasury Management (Hong Kong) Ltd.
--(2 rows)

Now I'm not sure how to proceed with this.

Some ideas I've had:

Include an (optional) md5 of the entity ID to facilitate matching (probably the least bad option)
Only match on md5 of entity ID (complicates debugging, makes the process much slower)
override both companies (probably dangerous)
use the literal entity ID (need to clear with FS)

I'm passing a file over Teams with the latest results. the code below needs the invisible removed, but you'll see that the only difference in relevant content is the company flagged above

suppressPackageStartupMessages(library(dplyr))
foo <- readRDS("timestamp-20230123T000000Z_pulled-20000101T000001_factset_manual_sector_override.rds")
new <- foo %>%
  select(factset_entity_id, pacta_sector_override) %>%
  arrange(factset_entity_id)
pdp <- pacta.data.preparation::factset_manual_pacta_sector_override %>%
  select(factset_entity_id, pacta_sector_override) %>%
  arrange(factset_entity_id)
invisible(waldo::compare(
  x = new, x_arg = "new",
  y = pdp, y_arg = "pdp"
))

^{Created on 2024-02-15 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2 #> system aarch64, darwin23.0.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2024-02-15 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.2) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.1) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2) #> countrycode 1.5.0 2023-05-30 [1] CRAN (R 4.3.1) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1) #> data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.2) #> DBI 1.2.1 2024-01-12 [1] CRAN (R 4.3.2) #> dbplyr 2.4.0 2023-10-26 [1] CRAN (R 4.3.1) #> diffobj 0.3.5 2021-10-05 [1] CRAN (R 4.3.2) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2) #> dtplyr 1.3.1 2023-03-22 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.2) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.2) #> pacta.data.preparation 0.1.0.9000 2024-01-18 [1] Github (RMI-PACTA/pacta.data.preparation@9a091f5) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> RPostgres 1.4.6 2023-10-22 [1] CRAN (R 4.3.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1) #> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> waldo 0.5.2 2023-11-02 [1] CRAN (R 4.3.2) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

github-actions[bot] commented 4 months ago

Docker image from this PR (2d843b6705160a718dbf1a08315bb88b85b25d91) created

docker pull ghcr.io/rmi-pacta/workflow.factset:pr47

AlexAxthelm commented 4 months ago

commits worth noting:

be884102a3d8e44df2e1561be0f151ca8c6c7227 fixes company names with bad encodings. Looks like the CSV was exported from Excel
3a2f97891521775a5f48b956d2ce73534cb981a2 Update company names for entity IDs that have changed names in FS database.

cjyetman commented 4 months ago

Only match on md5 of entity ID (complicates debugging, makes the process much slower)

seems like the best option, and because our list of overrides should be small the performance hit shouldn't be too bad, no?

edit: after looking at the diff, I guess the list isn't so small 😱

cjyetman commented 4 months ago

be88410 fixes company names with bad encodings. Looks like the CSV was exported from Excel

LOL... welcome to hell. Glad it's fixed already.

cjyetman commented 4 months ago

3a2f978 Update company names for entity IDs that have changed names in FS database.

These seem reasonable, except for "Hitachi Metals, Ltd." -> "Proterial Ltd.", but who really knows 🤷🏻?

AlexAxthelm commented 4 months ago

Only match on md5 of entity ID (complicates debugging, makes the process much slower)

seems like the best option, and because our list of overrides should be small the performance hit shouldn't be too bad, no?

The big issue here is that the best option would be to change the filter line to something along the line of

dplyr::filter(md5(.data[["entity_proper_name"]]) %in% company_hashes)

which is hugely punative, since that turns a simple equality match on the table, (which is bad enough on a ~20M row table) to a (match time + hash time) * rowcount, and loses all the nice hidden optimizations that postgres is doing under the hood. The name column doesn't have an index, but querying a column on disk will always be faster than trying to query one that's being computed on the fly.

AlexAxthelm commented 4 months ago

After thinking about it a bit today, I'm thinking the best way forward is the first option I laid out, where there's an NULLable column in the overrides table that includes the MD5 of the entity ID we want to keep

pacta_override_mapping <- tibble::tribble(
  ~entity_proper_name, ~pacta_sector, ~entity_md5,
  "Abu Dhabi National Energy Co. PJSC", "Power", NULL,
  "China Huaneng Group Treasury Management (Hong Kong) Ltd.", "Power", "049bb887d39df3d0f6b41f450f0186d3"
  "Hrvatska Elektroprivreda dd", "Power", NULL
)

Then adding some checks for uniqueness in the output table and throw warnings if there are multiple companies found in the table that don't have a disambiguating md5.

This keeps the computation time similar to what it currently is (not too bad), keeps the maintenance burden about the same (We can maintain just the human-readable strings for most cases), but still gets us the behavior we want (a mechanism for uniqueness)

@cjyetman @jdhoffa If no objections, I'll work on implementing this

AlexAxthelm commented 4 months ago

@cjyetman @jdhoffa New file is available via teams, and this is ready for review.

suppressPackageStartupMessages(library(dplyr))
foo <- readRDS("timestamp-20230123T000000Z_pulled-20000101T000002_factset_manual_sector_override.rds")
new <- foo %>%
  select(factset_entity_id, pacta_sector_override) %>%
  arrange(factset_entity_id)
pdp <- pacta.data.preparation::factset_manual_pacta_sector_override %>%
  select(factset_entity_id, pacta_sector_override) %>%
  arrange(factset_entity_id)
waldo::compare(
  x = new, x_arg = "new",
  y = pdp, y_arg = "pdp"
)
#> `class(new)`: "tbl_df"      "tbl"    "data.frame"             
#> `class(pdp)`: "spec_tbl_df" "tbl_df" "tbl"        "data.frame"
#> 
#> `attr(new, 'problems')` is absent
#> `attr(pdp, 'problems')` is a pointer
#> 
#> `attr(new, 'spec')` is absent
#> `attr(pdp, 'spec')` is an S3 object of class <col_spec>, a list

^{Created on 2024-02-16 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2 #> system aarch64, darwin23.0.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2024-02-16 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.2) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.1) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2) #> countrycode 1.5.0 2023-05-30 [1] CRAN (R 4.3.1) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1) #> data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.2) #> DBI 1.2.1 2024-01-12 [1] CRAN (R 4.3.2) #> dbplyr 2.4.0 2023-10-26 [1] CRAN (R 4.3.1) #> diffobj 0.3.5 2021-10-05 [1] CRAN (R 4.3.2) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2) #> dtplyr 1.3.1 2023-03-22 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.2) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.2) #> pacta.data.preparation 0.1.0.9000 2024-01-18 [1] Github (RMI-PACTA/pacta.data.preparation@9a091f5) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> RPostgres 1.4.6 2023-10-22 [1] CRAN (R 4.3.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1) #> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> waldo 0.5.2 2023-11-02 [1] CRAN (R 4.3.2) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

RMI-PACTA / workflow.factset

Feat/35 pacta sector override table #47