Breakthrough-Energy / PreREISE

Generate input data for scenario framework
https://breakthrough-energy.github.io/docs/
MIT License
21 stars 28 forks source link

feat: initial files for DOE flexibility data #282

Closed dongqi-wu closed 2 years ago

dongqi-wu commented 2 years ago

Pull Request doc

Purpose

Add support of downloading, cleaning, preparing and formatting DOE demand flexibility data This PR serves as an intermediate step to make part of a WIP code available for review and comments.

What the code is doing

1) download DOE flexibility data and perform preliminary cleanup. 2) collect geographical information for buses in synthetic grid. 3) collect and prepare population, FIPS and ZIP data to be used in mapping flexibility data to synthetic grid (WIP)

Testing

The functions are tested individually, but a more integrated and comprehensive test will be needed in the future.

Where to look

1) python scripts under folders doe and geo 2) cached Pickle files under folder cache

Time estimate

45 minutes

lanesmith commented 2 years ago

Before I have the chance to review doe/doe_data.py and geo/geo_data.py more thoroughly, I have a few initial questions and comments:

The second and third bullet points are meant to generate more of a discussion as we think about implementation.

dongqi-wu commented 2 years ago

Before I have the chance to review doe/doe_data.py and geo/geo_data.py more thoroughly, I have a few initial questions and comments:

  • Have the data files in raw gone through the data-intake process?
  • I understand that the files in raw are needed for this process, but do they need to be stored in the repository? Is it possible for these files to just be downloaded for use in the creation of the flexibility profiles (i.e., downloaded and deleted, or downloaded elsewhere)? If the files should be stored locally, are we certain that this data is not already present elsewhere in PreREISE (i.e., to avoid redundancy)?
  • Similarly, is there a reason that the .pkl files in cache need to be saved within the repository? I'm sure the files will be useful for future profile creation and the .pkl files should be faster than accessing the .csv files each time, but is there a reason that such an intermediate process will be needed (i.e., if the .csv files are loaded in memory and a user is creating their desired flexibility profiles, will the .pkl files be needed?)?
  • Be sure to check out the conventions we use for writing docstrings in your Python files (i.e., doe/doe_data.py and geo/geo_data.py). Other files within this repository should probably serve as a good example.

The second and third bullet points are meant to generate more of a discussion as we think about implementation.

  1. No
  2. No
  3. Yes. running the code to get those data from the API can take 10+ hrs and the server that runs those API is not very reliable itself.
  4. Ah I copied from the julia code in reise. it seems you have a different format for PY. I will change it later.
danielolsen commented 2 years ago

@dongqi-wu the best reference for the python docstring conventions is https://breakthrough-energy.github.io/docs/dev/docstring_guide.html.

rouille commented 2 years ago

@dongqi-wu, you can create a data-intake issue. There is a template under New Issue. Thanks.

danielolsen commented 2 years ago

If we want to go through the entire workflow from source files to outputs, what's the order in which the functions should be called? I see test.py which goes through the downloading and aggregation of the DOE data, but nothing about the order in which the geo_data.py functions should be called.

dongqi-wu commented 2 years ago

If we want to go through the entire workflow from source files to outputs, what's the order in which the functions should be called? I see test.py which goes through the downloading and aggregation of the DOE data, but nothing about the order in which the geo_data.py functions should be called.

There will be separate files later. I was not sure which repository these code should go so I did not upload them. They will be like the text.py functions that does the work from downloading the data to saving the .pkl cache files.

danielolsen commented 2 years ago

It seems like this workflow would only need as input an arbitrary list of lat/lon pairs, right? The format could either be a dataframe with two columns, or an iterable of tuples. To do this step manually for the whole grid, we could pass the list of substation coordinates from PowerSimData, or we could just pass a couple locations for a quick test of all the relevant functionality.

dongqi-wu commented 2 years ago

It seems like this workflow would only need as input an arbitrary list of lat/lon pairs, right? The format could either be a dataframe with two columns, or an iterable of tuples. To do this step manually for the whole grid, we could pass the list of substation coordinates from PowerSimData, or we could just pass a couple locations for a quick test of all the relevant functionality.

Yes. I have another intermediate file that I did not upload that contains a dataframe of busid and lat/long pairs of all buses. If one just want to test with a few buses they can use a small dataframe with a few rows.

rouille commented 2 years ago

There are a lot of /, \\ in os.path, os.isfile. We should be careful and make sure it is platform independent.

dongqi-wu commented 2 years ago

The new commit should have fixed the style and formatting comments ( at least according to my local flake8/black) I have also added example scripts that contains the function calls under each sub folder and test function to test the geopy/FCC APIs

dongqi-wu commented 2 years ago

Overall, it looks good to me! I left some comments, but they are pretty minor. Once these comments are addressed, I think this PR can be merged. I'm not sure if others wanted to look it over before merging, so maybe double check with @BainanXia first.

Hi Lanes, thanks for the review. I will update the PR soon.

dongqi-wu commented 2 years ago

Sorry for the delay. I went through the code and it looks good in general. I appreciate the efforts addressing all the comments so far. I left some comments by thinking loud here, so @dongqi-wu please don't be scared if any of the suggestions potentially induces large refactor, we can save it for future PRs which is on us:) In addition to those inline comments, I have following questions to discuss with the team

  1. Given the directory named by "flexibilitydata", we might have other flexibility workflow integrated here as well. Hence, I would suggest to have doe subfolder be the root directory of everything in this PR (except for the readme.md which is supposed to be at the same level with doe), then have all the modules, bus_data.py, geo_data.py, doe_data,py and corresponding examples there (remove subfolder bus, geo) there. Also, put tests (rename test by tests) folder under doe following the convention.
  2. We will need to create an empty init file under "flexibilitydata" similarly as "hydrodata", etc.
  3. Could anyone remind me where the population related info is used in the two workflows of assigning eia_id to buses and aggregating demand flexibility percentage by LSE based on the DOE data?
  1. I guess this can be quite easily addressed
  2. same
  3. Actually, this traces back to the version several iteration ago when the approach was to extract ZIP level flexibility from DOE raw data, find the ZIP/FIPS of each bus and match using the geo codes. The way to produce FIPS level flexibility (as the DOE data + LSE service area only directly gives ZIP flexibility) was using a weighted average of all ZIP flexibilities from all ZIP codes that belong to the FIPS county, and the population data was used as weight. Then we decided to assign a LSE directly based on proximity and that part got removed. I think this version does not use the population part of the data, it just uses the ZIP<--> FIPS mapping contained there.
dongqi-wu commented 2 years ago

@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.

BainanXia commented 2 years ago

@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.

Thanks for taking care of those comments.

I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here.

Regarding the current logic of adding new buses, we don't ask the user to specify a sub_id during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in the bus2sub table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.

dongqi-wu commented 2 years ago

@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.

Thanks for taking care of those comments.

I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here.

Regarding the current logic of adding new buses, we don't ask the user to specify a sub_id during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in the bus2sub table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.

On a side note, maybe Dr. Xie have told you but I have been diagnosed of covid-19 three days after my plane landed and thats why I had to cancel the meeting on 19th. I am feeling better now (at least my fever has recovered) so I can start working on something, and I hope I can finish this version soon. I am still stuck in HongKong now and I hope I will be able to enter mainland several days before next meeting and we can try to find a better time slot for the future meetings.

BainanXia commented 2 years ago

@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.

Thanks for taking care of those comments. I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here. Regarding the current logic of adding new buses, we don't ask the user to specify a sub_id during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in the bus2sub table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.

On a side note, maybe Dr. Xie have told you but I have been diagnosed of covid-19 three days after my plane landed and thats why I had to cancel the meeting on 19th. I am feeling better now (at least my fever has recovered) so I can start working on something, and I hope I can finish this version soon. I am still stuck in HongKong now and I hope I will be able to enter mainland several days before next meeting and we can try to find a better time slot for the future meetings.

Hope you have a speedy recovery 🙏 In the meantime, I've marked the comments that are resolved as "resolved". We can put aside on the sub.csv related idea for now. Let's clean up remaining concerns once you back online.

lanesmith commented 2 years ago

As for the failing pytests, it looks like there is a problem with the test_get_bus_zip test in test_bus_data.py.

dongqi-wu commented 2 years ago

As for the failing pytests, it looks like there is a problem with the test_get_bus_zip test in test_bus_data.py.

The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.

BainanXia commented 2 years ago

As for the failing pytests, it looks like there is a problem with the test_get_bus_zip test in test_bus_data.py.

The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.

Sounds good and thanks for the effort. I think this is good to go once some minor doc strings got fixed and the commit history got cleaned up. Feel free to let us know if you need help on the latter.

dongqi-wu commented 2 years ago

As for the failing pytests, it looks like there is a problem with the test_get_bus_zip test in test_bus_data.py.

The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.

Sounds good and thanks for the effort. I think this is good to go once some minor doc strings got fixed and the commit history got cleaned up. Feel free to let us know if you need help on the latter.

It seems to be a network connectivity issue (see screenshot). I am going to clean the history, then edit the docstring and rebase. image

edit: I have fixed the history and the docstring typos. On second thought, maybe its better to run the tests before I rebase, in case anything comes up.

lanesmith commented 2 years ago

It seems to be a network connectivity issue (see screenshot). I am going to clean the history, then edit the docstring and rebase. image

edit: I have fixed the history and the docstring typos. On second thought, maybe its better to run the tests before I rebase, in case anything comes up.

Thanks for taking care of this! It looks like the tests passed, so once you're done with any rebasing you wanted to do, it looks like it should be good to merge.

dongqi-wu commented 2 years ago

@lanesmith @BainanXia I have just rebased, please see the latest commit.

BainanXia commented 2 years ago

@lanesmith @BainanXia I have just rebased, please see the latest commit.

Thanks! It's not necessary to squash everything into one commit though.