arfc / pride

(P)lan for (R)ap(I)d (DE)carbonization
BSD 3-Clause "New" or "Revised" License
3 stars 7 forks source link

EIA Hospital Market Analysis #78

Closed datw0258 closed 4 years ago

datw0258 commented 4 years ago

Updated the eia.py and associated education_market_analysis.ipynb and tests to reflect adding the medical sector. Also added a .ipynb to show the medical data.

pep8speaks commented 4 years ago

Hello @datw0258! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-07-24 18:26:05 UTC
samgdotson commented 4 years ago

To be clear: I am asking you to change the pd.read_csv(usecols=['column1', 'column2'] to pd.read_csv(usecols=[1, 2]).

To save you some time, the column numbers are this:

usecols=[4, 8, 10, 13, 14, 95]

datw0258 commented 4 years ago

Thanks for making the changes to the relative path, there are a couple of other things that need to be addressed, however.

The notebooks will never run on a Unix machine, as written. Why is that?

  • Whenever a file is saved on a Windows machine, it is written with DOS line endings (Windows = DOS machine). What looks like "Reported Primary Mover" on Windows looks like "Reported Primary\n Mover" on a Linux/Unix machine. So any Windows machine will be able to read it because it ignores the '\n' but a Unix machine will not.
  • There are a couple of solutions. Change the column names in eia.py to reflect the DOS line endings (it should be okay in Windows too). The better solution is to use column numbers rather than the names themselves. You can figure out the column numbers by doing (this will print a number and column name pair)
df = pd.read_csv(filename)
for key in enumerate(df.keys()):
    print(key)

Then, you will be able to read a csv file as

df = pd.read_csv(filename, header=5, usecols=[4, 8, 10])

Word of caution: These column numbers refer to the number in the spreadsheet. When manipulating the dataframe, the column numbers will be different. As long as the order is consistent that should be okay. The column numbers are superior to the names because it will run on any machine.

Finally, thoughts:

  • Windows can read Unix line endings, but not the other way around. That is, Unix can read them but they will always include something like "\r\n" when there's a line break.
  • Shorter column names will fix this too (but you'd have to go into the files themselves and change it, which is tedious and unnecessary). If you find yourself writing a dataframe to export, use short column names.
  • If you need to write an input file for SERPENT, you should use
with open(file, 'wb') as file:
[...]

because 'wb' means "write to binary" which will be read as unix line endings.

I believe I fixed this issue by changing the .csv files in the box to remove the \n. Is this a separate issue?

samgdotson commented 4 years ago

I downloaded the files in box today and the issue remains.

Edit: It looks like you changed it for only 2018, since you reference other files those also need to be changed.