Teal-Insights / r-wbids

R package to access and analyze World Bank International Debt Statistics (IDS)

https://teal-insights.github.io/r-wbids/

Other

3 stars 0 forks source link

Add ids_bulk* functions #29

Closed christophscheuch closed 1 week ago

christophscheuch commented 1 week ago

I started out with a draft for the functions first to initialize the review because I'm not 100% where this should go.

ids_bulk_series() fetches the 495 series out of the 572 from ids_list_series() which are apparently available through bulk download - is this as expected?
ids_bulk_files() fetches the 6 available xlsx files.
ids_bulk() downloads the xlsx files and reads them in. Afaik there is no direct way to read xlsx from web, so we need to readxl package. I added it as an optional package because not everybody will use the bulk option. Users get prompted to install readxl before the download begins.

Unfortunately, I cannot read in the xlsx files because my Mac has only 16GB RAM (Error: vector memory limit of 16.0 Gb reached, see mem.maxVSize()). Am I missing something about reading in the xlsx files? @Reu-hub @t-emery

Reu-hub commented 1 week ago

The downloaded Excel files have extra columns prefixed as column 1, column 2, ... which are not necessary. In order to read each file please use:

choosing columns

col_names <- readxl::read_excel(path = "file_name.xlsx", n_max = 0) column_range <- data.frame(names = names(col_names)) %>% mutate(type = ifelse(test = (grepl(pattern = "[0:9]",x = names)), yes = "numeric", no = "text")) %>% dplyr::filter(!grepl(pattern = "column",x = names,ignore.case = TRUE))

reading files

ids_data = readxl::read_excel( path = "file_name.xlsx", range = readxl::cell_cols(1:nrow(column_range)), col_types = column_range$type )

christophscheuch commented 1 week ago

The downloaded Excel files have extra columns prefixed as column 1, column 2, ... which are not necessary. In order to read each file please use:

choosing columns

col_names <- readxl::read_excel(path = "file_name.xlsx", n_max = 0) column_range <- data.frame(names = names(col_names)) %>% mutate(type = ifelse(test = (grepl(pattern = "[0:9]",x = names)), yes = "numeric", no = "text")) %>% dplyr::filter(!grepl(pattern = "column",x = names,ignore.case = TRUE))

reading files

ids_data = readxl::read_excel( path = "file_name.xlsx", range = readxl::cell_cols(1:nrow(column_range)), col_types = column_range$type )

Amazing, thank you this worked! I'll finalize the PR later today.

chriscarrollsmith commented 1 week ago

I checked the licenses of readxl and its dependencies, and fortunately none of them use GPL3.

chriscarrollsmith commented 1 week ago

I see a few issues here, including a few missing imports and timeouts for large downloads. I will just go ahead and push a fix.

chriscarrollsmith commented 1 week ago

Sorry this is taking so long. I built tests and am trying to get them to pass.

chriscarrollsmith commented 1 week ago

@christophscheuch, I added interactive warnings for file sizes >100 MB. Also added a customizable timeout and function arguments to customize these behaviors. Refactored ids_bulk and added a unit test suite. Take a look and see what you think about these design decisions from a UX perspective.

Almost all my test cases are passing, but when I try to read_excel my chosen test file in the final, end-to-end test, I get the following error:

"Error in utils::unzip(zip_path, list = TRUE): error -103 with zipfile in unzGetCurrentFileInfo"

This happens with your last version as well as my updated version, so it may be something unexpected about the file rather than something wrong with the code. I have been testing with "https://datacatalogfiles.worldbank.org/ddh-published/0038015/DR0092201/A_D.xlsx?versionId=2024-10-08T01:35:39.3946879Z". What file were you using in your local testing to verify that the code works?

chriscarrollsmith commented 1 week ago

Ah, I think I solved it! You have to use "wb" mode in download.file with an Excel file, or it will fail for Windows users.

chriscarrollsmith commented 1 week ago

Still one test not passing, but pushed because I'm calling it a night!

christophscheuch commented 1 week ago

Still one test not passing, but pushed because I'm calling it a night!

Great work! I feel inspired to add more tests like these in my own packages in the future :)

I only had to add test_path to debug the last test and I added validation for column names and types.

chriscarrollsmith commented 1 week ago

I only had to add test_path to debug the last test and I added validation for column names and types.

Yes! Thanks for catching that! These are the sorts of things you overlook when you're exhausted because you keep telling yourself you'll go to bed as soon as the tests pass, and they keep not passing. :')

Great work! I feel inspired to add more tests like these in my own packages in the future :)

I know a lot of people who consider test-writing a waste of time, but for me the feedback loop between tests and code is essential to my process. Not only does it save me having to manually check all the edge cases I want the code to handle, but it also documents what those edge cases are and how the code is supposed to handle them, which makes it a lot easier to write docs later!