Closed dongqi-wu closed 2 years ago
Before I have the chance to review doe/doe_data.py
and geo/geo_data.py
more thoroughly, I have a few initial questions and comments:
raw
gone through the data-intake process?raw
are needed for this process, but do they need to be stored in the repository? Is it possible for these files to just be downloaded for use in the creation of the flexibility profiles (i.e., downloaded and deleted, or downloaded elsewhere)? If the files should be stored locally, are we certain that this data is not already present elsewhere in PreREISE
(i.e., to avoid redundancy)?cache
need to be saved within the repository? I'm sure the files will be useful for future profile creation and the .pkl files should be faster than accessing the .csv files each time, but is there a reason that such an intermediate process will be needed (i.e., if the .csv files are loaded in memory and a user is creating their desired flexibility profiles, will the .pkl files be needed?)?doe/doe_data.py
and geo/geo_data.py
). Other files within this repository should probably serve as a good example.The second and third bullet points are meant to generate more of a discussion as we think about implementation.
Before I have the chance to review
doe/doe_data.py
andgeo/geo_data.py
more thoroughly, I have a few initial questions and comments:
- Have the data files in
raw
gone through the data-intake process?- I understand that the files in
raw
are needed for this process, but do they need to be stored in the repository? Is it possible for these files to just be downloaded for use in the creation of the flexibility profiles (i.e., downloaded and deleted, or downloaded elsewhere)? If the files should be stored locally, are we certain that this data is not already present elsewhere inPreREISE
(i.e., to avoid redundancy)?- Similarly, is there a reason that the .pkl files in
cache
need to be saved within the repository? I'm sure the files will be useful for future profile creation and the .pkl files should be faster than accessing the .csv files each time, but is there a reason that such an intermediate process will be needed (i.e., if the .csv files are loaded in memory and a user is creating their desired flexibility profiles, will the .pkl files be needed?)?- Be sure to check out the conventions we use for writing docstrings in your Python files (i.e.,
doe/doe_data.py
andgeo/geo_data.py
). Other files within this repository should probably serve as a good example.The second and third bullet points are meant to generate more of a discussion as we think about implementation.
@dongqi-wu the best reference for the python docstring conventions is https://breakthrough-energy.github.io/docs/dev/docstring_guide.html.
@dongqi-wu, you can create a data-intake issue. There is a template under New Issue
. Thanks.
If we want to go through the entire workflow from source files to outputs, what's the order in which the functions should be called? I see test.py
which goes through the downloading and aggregation of the DOE data, but nothing about the order in which the geo_data.py
functions should be called.
If we want to go through the entire workflow from source files to outputs, what's the order in which the functions should be called? I see
test.py
which goes through the downloading and aggregation of the DOE data, but nothing about the order in which thegeo_data.py
functions should be called.
There will be separate files later. I was not sure which repository these code should go so I did not upload them. They will be like the text.py
functions that does the work from downloading the data to saving the .pkl cache files.
It seems like this workflow would only need as input an arbitrary list of lat/lon pairs, right? The format could either be a dataframe with two columns, or an iterable of tuples. To do this step manually for the whole grid, we could pass the list of substation coordinates from PowerSimData, or we could just pass a couple locations for a quick test of all the relevant functionality.
It seems like this workflow would only need as input an arbitrary list of lat/lon pairs, right? The format could either be a dataframe with two columns, or an iterable of tuples. To do this step manually for the whole grid, we could pass the list of substation coordinates from PowerSimData, or we could just pass a couple locations for a quick test of all the relevant functionality.
Yes. I have another intermediate file that I did not upload that contains a dataframe of busid and lat/long pairs of all buses. If one just want to test with a few buses they can use a small dataframe with a few rows.
There are a lot of /
, \\
in os.path
, os.isfile
. We should be careful and make sure it is platform independent.
The new commit should have fixed the style and formatting comments ( at least according to my local flake8/black) I have also added example scripts that contains the function calls under each sub folder and test function to test the geopy/FCC APIs
Overall, it looks good to me! I left some comments, but they are pretty minor. Once these comments are addressed, I think this PR can be merged. I'm not sure if others wanted to look it over before merging, so maybe double check with @BainanXia first.
Hi Lanes, thanks for the review. I will update the PR soon.
Sorry for the delay. I went through the code and it looks good in general. I appreciate the efforts addressing all the comments so far. I left some comments by thinking loud here, so @dongqi-wu please don't be scared if any of the suggestions potentially induces large refactor, we can save it for future PRs which is on us:) In addition to those inline comments, I have following questions to discuss with the team
- Given the directory named by "flexibilitydata", we might have other flexibility workflow integrated here as well. Hence, I would suggest to have
doe
subfolder be the root directory of everything in this PR (except for thereadme.md
which is supposed to be at the same level withdoe
), then have all the modules,bus_data.py
,geo_data.py
,doe_data,py
and corresponding examples there (remove subfolderbus
,geo
) there. Also, puttests
(renametest
bytests
) folder underdoe
following the convention.- We will need to create an empty init file under "flexibilitydata" similarly as "hydrodata", etc.
- Could anyone remind me where the population related info is used in the two workflows of assigning
eia_id
to buses and aggregating demand flexibility percentage by LSE based on the DOE data?
@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.
@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.
Thanks for taking care of those comments.
I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here.
Regarding the current logic of adding new buses, we don't ask the user to specify a sub_id
during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in the bus2sub
table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.
@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.
Thanks for taking care of those comments.
I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here.
Regarding the current logic of adding new buses, we don't ask the user to specify a
sub_id
during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in thebus2sub
table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.
On a side note, maybe Dr. Xie have told you but I have been diagnosed of covid-19 three days after my plane landed and thats why I had to cancel the meeting on 19th. I am feeling better now (at least my fever has recovered) so I can start working on something, and I hope I can finish this version soon. I am still stuck in HongKong now and I hope I will be able to enter mainland several days before next meeting and we can try to find a better time slot for the future meetings.
@BainanXia I have fixed the codes on most comments except the sub.csv one. I read the codes and there will be a lot of changes from inputs to caches to processing scripts if we switch from query by bus to by substation. All the buses have already been processed anyway and there is no need to run this entire thing again unless there will be a completely new synthetic grid coming so repetitive work is probably not a concern. However for the adding new buses to powersimdata, I wonder how is it done currently? Do the user need to specify a subid for every new bus in the system? If so can users also create new substations? Depending on the way adding bus in powersimdata works it might be worth to do the refactor to keep things consistent.
Thanks for taking care of those comments. I understand some of them might induce major refactor and we can leave it for future refactors. There will be a completely new synthetic grid based on HIFLD data base rolling out by us in near future as you might aware of;) Also, just want to make sure you looped through all the comments above given some of them are hidden by Github UI and I don't see your feedback on or changes addressing those, starting from this to here. Regarding the current logic of adding new buses, we don't ask the user to specify a
sub_id
during the process, instead, the user is required to specify the coordinates (lat, lon) of the new bus, then we will check whether there is a substation at the location already, if so add the new bus to the existing substation in thebus2sub
table, otherwise, as new substation will be created at the new location. The corresponding implementation can be found in bus.py and transform_grid.py.On a side note, maybe Dr. Xie have told you but I have been diagnosed of covid-19 three days after my plane landed and thats why I had to cancel the meeting on 19th. I am feeling better now (at least my fever has recovered) so I can start working on something, and I hope I can finish this version soon. I am still stuck in HongKong now and I hope I will be able to enter mainland several days before next meeting and we can try to find a better time slot for the future meetings.
Hope you have a speedy recovery 🙏 In the meantime, I've marked the comments that are resolved as "resolved". We can put aside on the sub.csv
related idea for now. Let's clean up remaining concerns once you back online.
As for the failing pytests, it looks like there is a problem with the test_get_bus_zip
test in test_bus_data.py
.
As for the failing pytests, it looks like there is a problem with the
test_get_bus_zip
test intest_bus_data.py
.
The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.
As for the failing pytests, it looks like there is a problem with the
test_get_bus_zip
test intest_bus_data.py
.The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.
Sounds good and thanks for the effort. I think this is good to go once some minor doc strings got fixed and the commit history got cleaned up. Feel free to let us know if you need help on the latter.
As for the failing pytests, it looks like there is a problem with the
test_get_bus_zip
test intest_bus_data.py
.The new commit should have fixed the tests. I did not fix them earlier because I was taking a flight from HK to China mainland 2 days ago and I was quite busy with all the stuffs. However it seems the commit history is messed up. I updated the readme.md file on browser and it created another commit that was not on my local pc when I corrected the problems that were causing the tests to fail and my github client merged the 2 commits together. I tried to clean it up using git in PowerShell but it seems it cannot work through my VPN correctly so I couldn't push on powershell, but pushing through the github client creates those merge commits. I am still trying to find a way around this. Meanwhile the codes themselves should be find and should past the tests as I tried pytest and flake8 on my local folder.
Sounds good and thanks for the effort. I think this is good to go once some minor doc strings got fixed and the commit history got cleaned up. Feel free to let us know if you need help on the latter.
It seems to be a network connectivity issue (see screenshot). I am going to clean the history, then edit the docstring and rebase.
edit: I have fixed the history and the docstring typos. On second thought, maybe its better to run the tests before I rebase, in case anything comes up.
It seems to be a network connectivity issue (see screenshot). I am going to clean the history, then edit the docstring and rebase.
edit: I have fixed the history and the docstring typos. On second thought, maybe its better to run the tests before I rebase, in case anything comes up.
Thanks for taking care of this! It looks like the tests passed, so once you're done with any rebasing you wanted to do, it looks like it should be good to merge.
@lanesmith @BainanXia I have just rebased, please see the latest commit.
@lanesmith @BainanXia I have just rebased, please see the latest commit.
Thanks! It's not necessary to squash everything into one commit though.
Pull Request doc
Purpose
Add support of downloading, cleaning, preparing and formatting DOE demand flexibility data This PR serves as an intermediate step to make part of a WIP code available for review and comments.
What the code is doing
1) download DOE flexibility data and perform preliminary cleanup. 2) collect geographical information for buses in synthetic grid. 3) collect and prepare population, FIPS and ZIP data to be used in mapping flexibility data to synthetic grid (WIP)
Testing
The functions are tested individually, but a more integrated and comprehensive test will be needed in the future.
Where to look
1) python scripts under folders doe and geo 2) cached Pickle files under folder cache
Time estimate
45 minutes