coderaanalytics / econdatar

R package for uploading and downloading data to/from www.econdata.co.za
MIT License
6 stars 2 forks source link

Tidy API options, making package R CMD Check compliant + CI Workflow #13

Closed SebKrantz closed 1 year ago

SebKrantz commented 1 year ago

Hi, I implemented the tidy API along the lines you envisioned in #9. I also did minor things in the code and documentation that were flagged by R CMD Check, but that do not change in any way the way this package works. The tidy API option is also completely non-destructive. The package could now be sent to CRAN, although they will probably still tell you to improve some things. One thing to wonder about is the LICENSE. I saw you put GPL-2 in the DESCRIPTION, but the actual license is EPL-2. This is a non-standard open source license and CRAN might complain about that. Also note that the standard way now to specify GPL-2 in the DESCRIPTION is GPL-3 + file LICENSE (as GPL-2 is stronger than GPL-3, R-CMD check warns about other specifications). I have also done this for you, noting of course that your LICENSE file is EPL-2, which may however be also be understood as additional requirements on top of GPL-3.

byrongibby commented 1 year ago

Hi, thanks for the contribution, it is much appreciated. Please give me some time to review the pull request.

byrongibby commented 1 year ago

Thanks again for the useful contribution. I am happy with most of the design decisions you made and I think the documentation is great.

Some issues that I feel need to be addressed:

  1. There is no guarantee that any time series will have label metadata or a source code (the BA data is an example).
  2. The switches you employ to create more human friendly metadata are inherently brittle as they do not link dynamically to the data in the data catalogue. It is fairly straight forward to retrieve this information from the data registry, but it comes with some considerations: The non-tidy/original data format is intentionally left in a more raw form to aid in the serialisation/deserialisation of data between the web server and R and so I don't want to change that "upstream" of your functions. read_econdata can reach out to the data catalogue and provide this data to econdata_tidy, but then econdata_tidy can't stand on its own. Or econdata_tidy can reach out to the server itself for the information. The latter option has the disadvantage of making econdata_tidy impure - it is strongly preferred that the IO be kept seperate from the logic wherever possible. In either case the interface you have designed will need modification. I will write the code that fetches the information from the server and you can incorporate it once you have decided on a way forward.
  3. Any observation may have in addition to TIME_PERIOD and OBS_VALUE other attributes (extra columns), for example OBS_STATUS indicating whether the value is an actual or a forecast observation (I can provide an example if you need).
  4. I would prefer that you use "datakey" instead of "code" and "period" instead of "date" as identifiers for those concepts so as to provide consistency across the EconData ecosystem.
  5. Does your code, specifically the wide format, allow for series of different frequencies in a single data set? See the FISCAL_SECTOR data set as an example.
SebKrantz commented 1 year ago

Thanks for the feedback. Regarding the points you have raised:

  1. Sure I can also make the label conditional. It would be better though to alter the database and ensure all series have a label. I also wondered about the banking data: how am I supposed to understand what those series are without labels?

  2. I’d rather avoid the switches all together and return the metadata as is, rather than doing additional calls to the web-server inside econdata_tidy(). Reason being it just slows down things, is complicated and impractical. Note however that ‘x’ is passed in again at the end of the switch, so if codes are not matched in the switch the code is just returned. I think it should be possible to create a switch the accommodates most common cases in your database, and as it is now it doesn’t break anything if the code is not matched in the switch. Making the data more readable in that way is worth it in my opinion, but I wouldn’t go for additional API calls just to accommodate 100% of cases, rather get rid of it and put an additional function in the package to get an updated reference for metadata codes.

  3. This additional data values are accommodated if wide = FALSE, as you may verify. I should still add a tolower() somewhere to ensure that they are also lowercase. E.g OBS_VALUE -> obs_value in line with the rest of the tidy data. I see now sensible way to include them in the wide format. It could be done of course, creating two or more columns for each series, but that would be very impractical for time series analysis. So I would keep it as is.

  4. Sure.

  5. Yes, wide data allows mixed frequency. Lower frequency series will have gaps, according to the way your database codes them (first or last day of the period).

On Sun 12. Mar 2023 at 14:47, Byron @.***> wrote:

Thanks again for the useful contribution. I am happy with most of the design decisions you made and I think the documentation is great.

Some issues that I feel need to be addressed:

  1. There is no guarantee that any time series will have label metadata or a source code (the BA data is an example).
  2. The switches you employ to create more human friendly metadata are inherently brittle as they do not link dynamically to the data in the data catalogue. It is fairly straight forward to retrieve this information from the data registry, but it comes with some considerations: The non-tidy/original data format is intentionally left in a more raw form to aid in the serialisation/deserialisation of data between the web server and R and so I don't want to change that "upstream" of your functions. read_econdata can reach out to the data catalogue and provide this data to econdata_tidy, but then econdata_tidy can't stand on its own. Or econdata_tidy can reach out to the server itself for the information. The latter option has the disadvantage of making econdata_tidy impure - it is strongly preferred that the IO be kept seperate from the logic wherever possible. In either case the interface you have designed will need modification. I will write the code that fetches the information from the server and you can incorporate it once you have decided on a way forward.
  3. Any observation may have in addition to TIME_PERIOD and OBS_VALUE other attributes (extra columns), for example OBS_STATUS indicating whether the value is an actual or a forecast observation (I can provide an example if you need).
  4. I would prefer that you use "datakey" instead of "code" and "period" instead of "date" as identifiers for those concepts so as to provide consistency across the EconData ecosystem.
  5. Does your code, specifically the wide format, allow for series of different frequencies in a single data set? See the FISCAL_SECTOR data set as an example.

— Reply to this email directly, view it on GitHub https://github.com/coderaanalytics/econdatar/pull/13#issuecomment-1465188744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALOT4UVCLASEJEEUG5TOLHTW3XAWTANCNFSM6AAAAAAVVHAL5I . You are receiving this because you authored the thread.Message ID: @.***>

byrongibby commented 1 year ago
  1. The individual parts that make up the "datakey" describe the series. This can be unwieldy at times which is why I created labels for most of the series, it's just too much of a time drain to go back and add labels where they don't exist for the time being. For the BA data looking at the source tables is the easiest way to orient yourself to the data, the fact that that causes the data registry not to be self-contained is an issue, but not one I am going to fix short-term.

  2. I have created a function that fetches and compiles the metadata into a structure that I think should be easily incorporated into your code (created a PR on your fork). I take your point of it slowing things down, but if you make it optional I think that provides the best of both. I am not keen on creating switches for common cases as this makes it difficult for the user to reason about and adds to the maintenance surface of the code base (most common use cases are not static).

  3. Agreed.

  4. Sounds perfect.

byrongibby commented 1 year ago

Going to merge and make the changes still required (if any).