RMI-PACTA / workflow.factset

Other
0 stars 0 forks source link

consider not exporting rows where `issue_type` == `NA` in `get_financial_data()` #37

Open cjyetman opened 7 months ago

cjyetman commented 7 months ago

given the first 2 lines of prepare_financial_data() https://github.com/RMI-PACTA/pacta.data.preparation/blob/530af7a154224b6303dcf87d869f550562ac553f/R/prepare_financial_data.R#L17-L18

maybe the best optimization would be to not export from FactSet any rows that have issue_type == NA?

factset_financial_data_path <- "~/Desktop/dataprep_docker/inputs/timestamp-20221231T000000Z_pulled-20240207T161053Z_factset_financial_data.rds"
financial_data <- readRDS(factset_financial_data_path)

nrow(financial_data)
#> [1] 29852045
format(object.size(financial_data), units = "auto", standard = "SI")
#> [1] "5.5 GB"

financial_data_no_na <- dplyr::filter(financial_data, !is.na(issue_type))
nrow(financial_data_no_na)
#> [1] 1582972
format(object.size(financial_data_no_na), units = "auto", standard = "SI")
#> [1] "305.3 MB"

Originally posted by @cjyetman in https://github.com/RMI-PACTA/pacta.data.preparation/issues/334#issuecomment-1942059663

AlexAxthelm commented 7 months ago

Yeah. That makes sense to me.