2DegreesInvesting / tiltPlot

Plots for the TILT project
https://2degreesinvesting.github.io/tiltPlot/
GNU General Public License v3.0
0 stars 0 forks source link

Review code for workflow Bundesbank #37

Closed AnneSchoenauer closed 10 months ago

AnneSchoenauer commented 10 months ago

Dear Mauro,

Please find here a project that describes the work flow with the Bundesbank. You can see a toy data set with which you can run the two R-scripts for the financial_analysis and for the financial_visualisation. I wanted to show you how much I need to prepare the dataset to be able to do some financial_visualisations. It would also be great to see if the code that I created (with the help of Linda) is correct.

Workflow_with_toy_data.zip

I would like you to do the following:

  1. Please open the R-Script_work_flow_financial_analysis and the R-Script_work_flow_financial_visualisation. It should work on your computer. (Please note I am not allowed to use any GitHub at the Bundesbank, therefore, I created a project with toy-data and send it to my 2DII Email adress, I checked it and it worked fine on my computer).
  2. Could you review the functions that I wrote and let me know if you understand what I am trying to do. If you look into the toy dataset you can see from sheet to sheet how I want to transform the data. I always checked after running the functions if the data that I created with my R-code produces the same toy data set in XLS that I created beforehand. I double checked and the code should be fine.
  3. The functions should be able to work for any benchmark ((upward) emission profiles) and any scenario/year ((upwards) sector profiles). Would be great if you could create functions for any kind of them. For now they only work for the former PCTR.
  4. For the research paper we would also need plots for time_series and box_plots: I am currently working on it and would update this after I did this.
  5. I also need to prepare the dataset even one step further. To do some analysis we need a dataset with only periode and company. This is indicated in the sheet "05d_company_periode". Who do you think is best to help me with this task? I think it is quite complicated but maybe it is also only using pivot_longer? Happy to hear you thoughts on this.
  6. Could you also please think about the "bigger" infrastructure of this workflow?

Several notes:

  1. Please note that this code is only for central banks. The difference is that for central banks you always will have many banks (bank_groups) included. That means that the whole step 5 - averaging across companies would be missing as this step averages across different banking groups. How should we handle this? Maybe we can leave it like this for now for central banks and work on functions for banks later? Shall I also create a toy dataset how a loanbook would look like for a bank? How do we best proceed?
  2. Please note that there is the column "periode" included. Central banks are not required to have this column but I think most of them will also have dataset over time. I thought it is better to include it as if central banks only have one periode, it would just group according to one period. So should be fine right?
  3. Please also note that unfortunately I wasn't able to run Linda's code for the sankey plots on this dataset. I don't know if I did something wrong or if there is a mistake. So would be great to have a special look at the sankey plots. I saved the outputs that I derived in the Output file in the "financial_visualiation" file so that you can have a look on this. Weirdly on my own computer they look fine... don't understand it to be honest.
  4. I also needed to manipulate Linda's code a bit so that I was able to run it on the data that I have from the central bank. I don't know if it is correctly manipulated. I also saw some mistakes in the Lindas' code, for example she averaged one time across groups, but it should be a sum. I will create tickets for this.
  5. I tried to be transparent as possible of what I want to achieve with the code. Behind this are some methodological decisions. I don't know how much you would need to understand about those methodological decisions but at any time, please let me know if there are questions!
  6. There are still many to dos that I indicated within the scripts. Do you think it make sense that I still work on those or should I wait until you come up with a new infrastructure?

Okay, this was a lot of text and maybe you want to create many tickets for this. Please also note that I am not quite sure if this would belong to the tilt.plot repo... So feel free to move this ticket, edit it with your own words.

Please let me know if anything is unclear!

All the best Anne

maurolepore commented 10 months ago

Thanks @AnneSchoenauer for your effort writing the code and this issue. I divide my answer in the sections entitled to reflect your questions/comments.

Understand what this code does

Here I explain what I understand. Correct me if necessary.

https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html * The code in the `*visualisation.R` files is downstream from the code in the `*analysis.R` files. And I think it's already implemented in tiltPlot, except for minor differences that should be submitted as issues to tiltPlot (reminded below). * I do NOT understand how a user would get the data that is the input to the `*analysis.R` files. It doesn't seem to be the output of the tiltIndicator (reminded as a question below). ### Assess if the code is correct I can't know without unit tests, but almost certainly there are many bugs. For reference, tiltIndicator has 279 thoughtful tests, and we still continue to find bugs. ### Review functions Are you looking for feedback about anything in particular? In general it's a great start: * I think I understand what the code does, * I can run the code successfully, and * I identified important code that we need to incorporate in the tilt ecosystem. To incorporate the useful code into our packages, I suggest we do what we successfully did with each of the 4 tilt indicators, and with the code that now belongs to tiltIndicatorBefore and tiltIndicatorAfter. Each author submitted all their code in a project similar to the one here, along with a single README.Rmd file analog to the `*work_flow*.R` files here. They used [this guide](https://bit.ly/tilt-mvp-guide). In [this slack-message](https://2investinginitiative.slack.com/archives/C02NDH6SD3P/p1687039917932199) I link to a ds.incubator where I explain how I use that project-structure to then evolve it into an R package. ### Adapt the code so that it works with all indicators Any enhancement must wait until we first have turned this MVP into production code. We need to first test what we have so that we can change it safely to do new things. Tests avoid breaking old things while we add new things. ### Develop a dataset with two columns: `periode` and `company` * Where do those columns come from? This will help me think what's the best place for that dataset. * What do we need to do with those columns? Why is it hard? For the record, [here is gist](https://gist.github.com/maurolepore/ae8c684694b03f49abbe4a946a2a64c9) of what the data looks like. ### Where does this code fit in the bigger tilt "infrastructure"? * The code in the `*visualisation.R` files seems to bleong to tiltPlot and I believe it's already. What I see here seems to be copy-pasted from tiltPlot. * At least some code in `*analysis.R` belongs to either a new package or to tiltIndicator (see "entities" in the figure above). * Then there is some code that deals with wrangling data. I'm still not sure if it belongs to the tilt tool (e.g. tiltIndicator) or to data-management (e.g. tiltData, tiltIndicatorBefore). This will become clearer once we go through the process of turning this MVP into production code. ### How can we deal with large data? We already deal with long-running processes when we run tiltIndicator with real data (see this [article](https://2degreesinvesting.github.io/tiltIndicator/articles/handling-long-runtime.html)). But for this code in particular, it's still to early to worry about performance. The order is this: 1. Make it work, 2. Make it right, 3. Make it fast (performance optimization comes last). ### Develop two additional types of plots: time-series and box-plots Discuss it with Linda so we can include them in tiltPlot. ### TODO: Submit issues to tiltPlot * Can you show Linda the bug in `mode`? A [reprex](https://reprex.tidyverse.org/) would be great. * Does the tiltPlot package need to handle `periode`? What exactly do we need to do with that column? From what I see here it seems that the code simply picks a specific value of `periode` with something like `filter(data, periode == "xyz")`. If that's all, then that wouldn't belong to tiltPlot. Filtering data and plotting data are two very distinct operations, and `filter()` already does an excellent job, so the user can use `filter()` directly. ### QUESTIONS 1. Can you clarify if this code is useful only for the bundesbank, only for central banks, or partially for all banks? I'm trying to understand if we need special packages for a sub-group of users. 2. I see the `*analysis.R` code uses datasets that don't seem to be the output of tiltIndicator. Then, where would they come from? Do we need any code to prepare those datasets? The datasets I refer to here are these ones: - Toy_data_sets.xlsx - bank_specifics.xlsx - company_specifics.xlsx
maurolepore commented 10 months ago

Here was the bulk of my input. I think it's best to close this issue now and follow up in focused issues wherever it's relevant. Else this issue will remain open indefinitely.