coderaanalytics / econdatar

R package for uploading and downloading data to/from www.econdata.co.za
MIT License
6 stars 2 forks source link

Password-free large scale tidy data access through R API? #6

Closed SebKrantz closed 1 year ago

SebKrantz commented 1 year ago

Hello, first of all thanks a lot for your effort of pulling together a lot of macroeconomic data for South Africa. I am, however, struggling with using either your R package or the SDMX/Clojure API using get requests (from python, mainly due to my insufficient understanding of such API's and the lack of examples I guess). Regarding the R package, I want to make 3 points:

All of this combined, I have to conclude that for my purposes (nowcasting GDP in South Africa using large amounts of public data), the API is useless. This is really a pitty as the underlying software engineering and your data model appear to be pretty comprehensive and powerful. As someone with a fair amount of experience in creating and using R API packages, I would like to end by providing some inspiration from my own recent work and the work of others.

I of course don't know what your intentions or business model are, but given your data collection efforts and ostensible commitment to open source software, the creation of powerful open APIs along these lines could be a game changer for everyone doing economic research in South Africa.

byrongibby commented 1 year ago

Hi Sebastian

Just a quick response while I digest the detail of what you have laid out:

It requires login using 3rd party software (XQuartz on Mac) on each request.

My understanding when writing the package was that tcl/tk is cross-platform and should pretty much just work everywhere, this seems to at least be true on Windows. In any case you can get around this by setting the following environment variable ECONDATA_CREDENTIALS="your_username;your_password" in your .Renviron file for example.

The package does seems to attach other packages alongside (httr, jsonlite, xml2, tcltk), and does not work if not attached, i.e. econdatar::read_econdata() does not work without a prior call to library(econdatar)

Excuse my ignorance on this point. What is the fix for this, ensuring that each function calls require() for the necessary dependencies?

The package does not seem to provide tidy tabular data

Indeed. I didn't want to include any more dependencies than necessary, but if it would be useful I can create an econdata2tibble() function?

Thanks for taking the time to give feedback.

byrongibby commented 1 year ago

On this point

I of course don't know what your intentions or business model are, but given your data collection efforts and ostensible commitment to open source software, the creation of powerful open APIs along these lines could be a game changer for everyone doing economic research in South Africa.

That is the direction we are aiming in, but of course we are a small company and the platform is foremost meant to serve our and our clients' needs, so it is a slow and often tedious process to get the platform up to a standard that meets user expectations.

That said, a number of academics have expressed interest in a standard data set for the purposes of GDP nowcasting along the lines of (AL)FRED's macroeconomic data set. We are seeking funding for such a project. It would be useful to let us know what variables/sources of data you are considering in your research if possible.

byrongibby commented 1 year ago

I would also like to add that the underlying API is likely to undergo some breaking changes during the course of this year so I would ask that you hold off on using it directly for the time being. While we are looking at simplifying the econdatar interface, I don't expect there will be any breaking changes in the package.

byrongibby commented 1 year ago

I forgot to mention there is a branch that excludes the tcl/tk package that is maintained for environments that don't have graphical capabilities:

require(remotes); install_github('coderaanalytics/econdatar', ref='docker', INSTALL_opts = c('--no-help', '--no-html'))
SebKrantz commented 1 year ago

Hi Sebastian

Just a quick response while I digest the detail of what you have laid out:

It requires login using 3rd party software (XQuartz on Mac) on each request.

My understanding when writing the package was that tcl/tk is cross-platform and should pretty much just work everywhere, this seems to at least be true on Windows. In any case you can get around this by setting the following environment variable ECONDATA_CREDENTIALS="your_username;your_password" in your .Renviron file for example.

Thanks for the reply. I was unaware of the environment variable option. This makes things better already, but on my system this still starts XQuartz in the background. Ideally you want something that only uses R e.g. requiring an API key that is sent to the server. For example if I would want to put this into production, I would like to install the package on a linux server, with only R pre-installed.

The package does seems to attach other packages alongside (httr, jsonlite, xml2, tcltk), and does not work if not attached, i.e. econdatar::read_econdata() does not work without a prior call to library(econdatar)

Excuse my ignorance on this point. What is the fix for this, ensuring that each function calls require() for the necessary dependencies?

So in general, you should not load or attach other packages at all from package code. The best way is to use importFrom statements in the NAMESPACE file, e.g. importFrom("jsonlite", "fromJSON"). Another option is using jsonlite::fromJSON in the package code, but this incurs the overhead of ::. You can see here for some examples of ImportFrom. In general, R CMD check should warn about this, have you run R CMD check on the package yet?

The package does not seem to provide tidy tabular data

Indeed. I didn't want to include any more dependencies than necessary, but if it would be useful I can create an econdata2tibble() function?

Thanks for taking the time to give feedback.

With tidy data I dont mean a particular object (like data.frame, data.table or tibble), but data that is organized in rows and columns and thus easy to process further. The definition of tidy data is that each variable/characteristic is a column and each observational unit is a row. In other words, tidy APIs return a single data frame like object. How this is best organized depends on the purpose. In my APIs I provide the options of both long and wide data return (+reshaping functions to be called ex-post). DBnomics provides a standardized long output format which can accommodate very heterogeneous data. Some examples:

###########
# Tidy APIs
###########

# Most APIs provide long format data: perfect for mixed frequency and ggplotting
POE_tidy_long <- ugatsdb::get_data("MOF_POE", wide = FALSE) # Performance of the economy dataset
tail(POE_tidy_long)
#>          Date Series                                         Label    Value
#>        <Date> <char>                                        <char>    <num>
#> 1: 2022-07-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 54.48846
#> 2: 2022-08-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 49.78164
#> 3: 2022-09-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 50.94488
#> 4: 2022-10-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 46.36710
#> 5: 2022-11-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 47.02394
#> 6: 2022-12-01 TB_COD Trade Balance with Congo (D.R.) (US$ Million) 48.52958

# The default in my APIs is wide data: ready for time series analysis
POE_tidy_wide <- ugatsdb::get_data("MOF_POE")
tail(POE_tidy_wide)[, 1:10]
#> Key: <Date>
#>          Date   CPI_16 CPI_CORE_16 CPI_FOOD_16 CPI_EFU_16 CPI_09 CPI_CORE_09 CPI_FOOD_09 CPI_EFU_09    INF_16
#>        <Date>    <num>       <num>       <num>      <num>  <num>       <num>       <num>      <num>     <num>
#> 1: 2022-08-01 123.2445    122.7355    115.4992   141.0762     NA          NA          NA         NA  9.003460
#> 2: 2022-09-01 125.1029    124.2073    121.9511   141.2153     NA          NA          NA         NA  9.993973
#> 3: 2022-10-01 126.1317    125.1876    126.3131   138.0120     NA          NA          NA         NA 10.710683
#> 4: 2022-11-01 126.2145    125.4758    126.1997   135.7387     NA          NA          NA         NA 10.583868
#> 5: 2022-12-01 126.3849    125.7720    126.6209   133.9239     NA          NA          NA         NA 10.232480
#> 6: 2023-01-01 126.1925    125.6436    126.1484   133.3166     NA          NA          NA         NA 10.404955

# This format has variable labels attached (can be displyed in the Rstudio viewer)
collapse::namlab(POE_tidy_wide)[1:10, ]
#>       Variable                                                                                                Label
#> 1         Date                                                                                                 <NA>
#> 2       CPI_16                        Consumer Price Index (CPI), (2016/17 = 100): All Items Index (Weight = 10000)
#> 3  CPI_CORE_16                            Consumer Price Index (CPI), (2016/17 = 100): Core Index (Weight = 8396.2)
#> 4  CPI_FOOD_16    Consumer Price Index (CPI), (2016/17 = 100): Food Crops and Related Items Index (Weight = 951.05)
#> 5   CPI_EFU_16 Consumer Price Index (CPI), (2016/17 = 100): Energy Fuel and Utilities (EFU) Index (Weight = 652.75)
#> 6       CPI_09                         Consumer Price Index (CPI), (2009/10 = 100): All Items Index (Weight = 1000)
#> 7  CPI_CORE_09                                  Consumer Price Index (CPI), (2009/10 = 100): Core (Weight = 823.94)
#> 8  CPI_FOOD_09           Consumer Price Index (CPI), (2009/10 = 100): Food Crops and Related Items (Weight = 101.6)
#> 9   CPI_EFU_09              Consumer Price Index (CPI), (2009/10 = 100): Energy Fuel and Utilities (Weight = 74.46)
#> 10      INF_16                                   Annual (YoY) Inflation (2016/17): All Items Index (Weight = 10000)

# E.g.
attr(POE_tidy_wide$CPI_16, "label")
#> [1] "Consumer Price Index (CPI), (2016/17 = 100): All Items Index (Weight = 10000)"

# Examples with panel data
# Getting growth and inflation for the EAC countries
africamonitor::am_data(ctry = c("UGA", "KEN", "TZA", "RWA", "BDI", "SSD"),
        series = c("NGDP_RPCH", "PCPIPCH"))
#> Key: <ISO3, Date>
#>        ISO3       Date NGDP_RPCH PCPIPCH
#>      <char>     <Date>     <num>   <num>
#>   1:    BDI 1980-01-01    -6.825   1.200
#>   2:    BDI 1981-01-01    12.164  12.167
#>   3:    BDI 1982-01-01    -1.054   5.868
#>   4:    BDI 1983-01-01     3.715   8.151
#>   5:    BDI 1984-01-01     0.155  14.317
#>  ---                                    
#> 252:    UGA 2023-01-01     5.898   6.397
#> 253:    UGA 2024-01-01     5.996   5.707
#> 254:    UGA 2025-01-01     7.500   4.987
#> 255:    UGA 2026-01-01     6.815   5.023
#> 256:    UGA 2027-01-01     6.805   5.000

africamonitor::am_data(ctry = c("UGA", "KEN", "TZA", "RWA", "BDI", "SSD"),
                       series = c("NGDP_RPCH", "PCPIPCH"), wide = FALSE)
#>        ISO3       Date    Series                                                   Label  Value
#>      <char>     <Date>    <char>                                                  <char>  <num>
#>   1:    BDI 1980-01-01 NGDP_RPCH Gross Domestic Product, Constant Prices: Percent Change -6.825
#>   2:    BDI 1981-01-01 NGDP_RPCH Gross Domestic Product, Constant Prices: Percent Change 12.164
#>   3:    BDI 1982-01-01 NGDP_RPCH Gross Domestic Product, Constant Prices: Percent Change -1.054
#>   4:    BDI 1983-01-01 NGDP_RPCH Gross Domestic Product, Constant Prices: Percent Change  3.715
#>   5:    BDI 1984-01-01 NGDP_RPCH Gross Domestic Product, Constant Prices: Percent Change  0.155
#>  ---                                                                                           
#> 508:    UGA 2023-01-01   PCPIPCH      Inflation, Average Consumer Prices: Percent Change  6.397
#> 509:    UGA 2024-01-01   PCPIPCH      Inflation, Average Consumer Prices: Percent Change  5.707
#> 510:    UGA 2025-01-01   PCPIPCH      Inflation, Average Consumer Prices: Percent Change  4.987
#> 511:    UGA 2026-01-01   PCPIPCH      Inflation, Average Consumer Prices: Percent Change  5.023
#> 512:    UGA 2027-01-01   PCPIPCH      Inflation, Average Consumer Prices: Percent Change  5.000

# DBnomics: standardized long format allowing very heterogeneous data
rdbnomics::rdb("WB", "WGI", dimensions = list(country = "ZAF"))
#>      @frequency country dataset_code                    dataset_name frequency          indexed_at  indicator original_period original_value     period provider_code      series_code
#>          <char>  <char>       <char>                          <char>    <char>              <POSc>     <char>          <char>         <char>     <Date>        <char>           <char>
#>   1:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09     CC.EST            1996    0.732927382 1996-01-01            WB     A-CC.EST-ZAF
#>   2:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09     CC.EST            1998   0.6388086081 1998-01-01            WB     A-CC.EST-ZAF
#>   3:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09     CC.EST            2000   0.5502697229 2000-01-01            WB     A-CC.EST-ZAF
#>   4:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09     CC.EST            2002   0.3329015076 2002-01-01            WB     A-CC.EST-ZAF
#>   5:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09     CC.EST            2003   0.2755408287 2003-01-01            WB     A-CC.EST-ZAF
#>  ---                                                                                                                                                                                  
#> 824:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09 VA.STD.ERR            2017    0.119611159 2017-01-01            WB A-VA.STD.ERR-ZAF
#> 825:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09 VA.STD.ERR            2018   0.1232636794 2018-01-01            WB A-VA.STD.ERR-ZAF
#> 826:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09 VA.STD.ERR            2019   0.1180535406 2019-01-01            WB A-VA.STD.ERR-ZAF
#> 827:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09 VA.STD.ERR            2020   0.1205894873 2020-01-01            WB A-VA.STD.ERR-ZAF
#> 828:     annual     ZAF          WGI Worldwide Governance Indicators         A 2022-11-08 12:49:09 VA.STD.ERR            2021   0.1212761179 2021-01-01            WB A-VA.STD.ERR-ZAF
#>                                                           series_name     value
#>                                                                <char>     <num>
#>   1:          Annual – Control of Corruption: Estimate – South Africa 0.7329274
#>   2:          Annual – Control of Corruption: Estimate – South Africa 0.6388086
#>   3:          Annual – Control of Corruption: Estimate – South Africa 0.5502697
#>   4:          Annual – Control of Corruption: Estimate – South Africa 0.3329015
#>   5:          Annual – Control of Corruption: Estimate – South Africa 0.2755408
#>  ---                                                                           
#> 824: Annual – Voice and Accountability: Standard Error – South Africa 0.1196112
#> 825: Annual – Voice and Accountability: Standard Error – South Africa 0.1232637
#> 826: Annual – Voice and Accountability: Standard Error – South Africa 0.1180535
#> 827: Annual – Voice and Accountability: Standard Error – South Africa 0.1205895
#> 828: Annual – Voice and Accountability: Standard Error – South Africa 0.1212761

Created on 2023-02-28 by the reprex package (v2.0.1)

SebKrantz commented 1 year ago

That said, a number of academics have expressed interest in a standard data set for the purposes of GDP nowcasting along the lines of (AL)FRED's macroeconomic data set. We are seeking funding for such a project. It would be useful to let us know what variables/sources of data you are considering in your research if possible.

I can share the variables with you once I have a model running and a working paper.

byrongibby commented 1 year ago

The issues you have with the tcl/tk package and with econdatar attaching packages should be resolved in release 1.1.6

install_github("coderaanalytics/econdatar", ref = "1.1.6")

You can comment on these issues #7 and #8, else I will assume I can close them.

I will add a feature for read_econdata to produce "tidy" data sometime, hopefully soon.

SebKrantz commented 1 year ago

Thanks a lot, this looks good so far, I am able to verify that in 1.1.6 the package works without being attached. I have no further comments on the other issues if you can resolve them.

A final comment on the API is that it would be good if the API could be self contained i.e. there could be functions to get information about the available data and codes (like the datasources(), datasets() and series() functions in ugatsdb or rdb_providers(), rdb_datasets(), rdb_dimensions() in rdbnomics), so that the data can be explored and retrieved without having to browse your website, and it is very clear from the package documentation how the codes returned by such auxiliary functions are to be used in the main package function (get_data() in ugatsdb, rdb() in rdbnomics, or read_econdata()) in your case. As far as I understand it read_econdata() can be used to query both data and metadata, but this complicates things, and it is already necessary to know something about your data model to even get the metadata. In general, it is better to have a package with a few streamlined functions where it is clear how they are to be used and what they return, than a single function that does everything.

Regading package development, I suggest setting up some form of CI on GitHub soon using usethis::use_github_action("check-standard") (R CMD Check), and usethis::use_github_action("test-coverage") once you started to add tests using the testthat package. Also feel free to reach out to me with other issues/questions. You can close this issue or leave it open as you require.

byrongibby commented 1 year ago

A final comment on the API is that it would be good if the API could be self contained.

Only a small part of the web API is included in econdatar, I would like a one-to-one mapping, but that development is not that high on the priority list at the moment. I will add it to the list of issues

Regading package development, I suggest setting...

A more sophisticated development workflow will follow, but resources for this particular project are somewhat constrained for the moment.

Thanks for all the feedback, please feel free to post issues as necessary while you are using EconData