Closed AlexAxthelm closed 4 months ago
Docker image from this PR (2d843b6705160a718dbf1a08315bb88b85b25d91) created
docker pull ghcr.io/rmi-pacta/workflow.factset:pr47
commits worth noting:
- Only match on md5 of entity ID (complicates debugging, makes the process much slower)
seems like the best option, and because our list of overrides should be small the performance hit shouldn't be too bad, no?
edit: after looking at the diff, I guess the list isn't so small 😱
- be88410 fixes company names with bad encodings. Looks like the CSV was exported from Excel
LOL... welcome to hell. Glad it's fixed already.
- 3a2f978 Update company names for entity IDs that have changed names in FS database.
These seem reasonable, except for "Hitachi Metals, Ltd." -> "Proterial Ltd.", but who really knows 🤷🏻?
- Only match on md5 of entity ID (complicates debugging, makes the process much slower)
seems like the best option, and because our list of overrides should be small the performance hit shouldn't be too bad, no?
The big issue here is that the best option would be to change the filter line to something along the line of
dplyr::filter(md5(.data[["entity_proper_name"]]) %in% company_hashes)
which is hugely punative, since that turns a simple equality match on the table, (which is bad enough on a ~20M row table) to a (match time + hash time) * rowcount
, and loses all the nice hidden optimizations that postgres is doing under the hood. The name column doesn't have an index, but querying a column on disk will always be faster than trying to query one that's being computed on the fly.
After thinking about it a bit today, I'm thinking the best way forward is the first option I laid out, where there's an NULL
able column in the overrides table that includes the MD5 of the entity ID we want to keep
pacta_override_mapping <- tibble::tribble(
~entity_proper_name, ~pacta_sector, ~entity_md5,
"Abu Dhabi National Energy Co. PJSC", "Power", NULL,
"China Huaneng Group Treasury Management (Hong Kong) Ltd.", "Power", "049bb887d39df3d0f6b41f450f0186d3"
"Hrvatska Elektroprivreda dd", "Power", NULL
)
Then adding some checks for uniqueness in the output table and throw warnings if there are multiple companies found in the table that don't have a disambiguating md5.
This keeps the computation time similar to what it currently is (not too bad), keeps the maintenance burden about the same (We can maintain just the human-readable strings for most cases), but still gets us the behavior we want (a mechanism for uniqueness)
@cjyetman @jdhoffa If no objections, I'll work on implementing this
@cjyetman @jdhoffa New file is available via teams, and this is ready for review.
suppressPackageStartupMessages(library(dplyr))
foo <- readRDS("timestamp-20230123T000000Z_pulled-20000101T000002_factset_manual_sector_override.rds")
new <- foo %>%
select(factset_entity_id, pacta_sector_override) %>%
arrange(factset_entity_id)
pdp <- pacta.data.preparation::factset_manual_pacta_sector_override %>%
select(factset_entity_id, pacta_sector_override) %>%
arrange(factset_entity_id)
waldo::compare(
x = new, x_arg = "new",
y = pdp, y_arg = "pdp"
)
#> `class(new)`: "tbl_df" "tbl" "data.frame"
#> `class(pdp)`: "spec_tbl_df" "tbl_df" "tbl" "data.frame"
#>
#> `attr(new, 'problems')` is absent
#> `attr(pdp, 'problems')` is a pointer
#>
#> `attr(new, 'spec')` is absent
#> `attr(pdp, 'spec')` is an S3 object of class <col_spec>, a list
Created on 2024-02-16 with reprex v2.0.2
Add function for creating override table to replace
pacta.data.preparation::factset_manual_pacta_sector_override
Closes #35
Overall this is working like I would hope, except for one major issue, noted below.
The general plan is that similar to #45 and #39 , we migrate a static table from
pacta.data.preparation
to this repo, where it get pulled from the database, and we can be aware of any changes in the DB that affect how we use those tables.In the case of this PR, we're looking at the PACTA sector override table, which we use to manually reclassify companies to. different sector (FS says this company is a financial company, but for PACTA purposes, it's an Automotive).
The function added in this PR takes a table of company names and PACTA sectors, and then interrogates the
sym_entity
table in the database to get the Factset entity ID, for use inoverride_sector_by_factset_entity_id()
. Nothing too complicated.The tricky bit here that I wasn't counting on when I went down this path is that I've found 2 companies in that table that have the same name, but different entity IDs
Now I'm not sure how to proceed with this.
Some ideas I've had:
I'm passing a file over Teams with the latest results. the code below needs the
invisible
removed, but you'll see that the only difference in relevant content is the company flagged aboveCreated on 2024-02-15 with reprex v2.0.2
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2 #> system aarch64, darwin23.0.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2024-02-15 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.2) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.1) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2) #> countrycode 1.5.0 2023-05-30 [1] CRAN (R 4.3.1) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1) #> data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.2) #> DBI 1.2.1 2024-01-12 [1] CRAN (R 4.3.2) #> dbplyr 2.4.0 2023-10-26 [1] CRAN (R 4.3.1) #> diffobj 0.3.5 2021-10-05 [1] CRAN (R 4.3.2) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2) #> dtplyr 1.3.1 2023-03-22 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.2) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.2) #> pacta.data.preparation 0.1.0.9000 2024-01-18 [1] Github (RMI-PACTA/pacta.data.preparation@9a091f5) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> RPostgres 1.4.6 2023-10-22 [1] CRAN (R 4.3.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1) #> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> waldo 0.5.2 2023-11-02 [1] CRAN (R 4.3.2) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```