Rationalize VMA dataframe

DentonGentry commented 4 years ago

We implemented VMA support to first take a CSV file with columns defined as they are in Excel: ['Conversion calculation', 'Original Units', 'Raw Data Input', 'Weight', 'Exclude Data?', 'World / Drawdown Region', 'Thermal-Moisture Regime']

we then have model/vma.py:_read_csv simplify and rename these columns to: ['Value', 'Units', 'Raw', 'Weight', 'Exclude?', 'Region', 'Main Region', 'TMR']

This will not make much sense once Excel is retired. We should move the logic which manipulates columns into their final form out of model/vma.py and into tools/vma_xls_extract.py, which will run while pulling the VMA values out of an Excel file and can write the CSVs in their final form.

This will also mean removing model/vma.py::VMA.source_data.

rifkinni commented 4 years ago

@FranzEricSchneider and I are going to take a stab at this one!

FranzEricSchneider commented 4 years ago

Question - are there Excel files somewhere we could see that tools/solution_xls_extract.py is meant to be run on? It would be good to see it running as it's meant to.

DentonGentry commented 4 years ago

The public versions of the Excel files are checked into the repository. I removed them once tests/test_excel_integration.py evolved to the point of not needing them any more, as they were quite large, but they are still in the history. They are stored using git LFS so they don't take up local disk space for people who check out HEAD.

git revert dcad0e37763807602e1220c9477dd24667127aa4

will restore the deleted Excel files. Look in solution/*/testdata/*.xlsm to find them. If you do revert this to get the Excel files back, please make sure not to include it in a future pull request as this would substantially increases the local disk space needed for others who check out the repo.

DentonGentry commented 4 years ago

One other thing to consider: https://github.com/ProjectDrawdown/solutions/issues/2 is about supporting Excel files for VMA, not just CSV as we currently do.

It is worth considering whether we can take the VMA Data sheet from the existing Drawdown Excel files, exactly as it is, and use it with the Python models. The research team has been comfortably using the Excel files, it might help acceptance to let them format the inputs in the same way as they are used to.

FranzEricSchneider commented 4 years ago

@DentonGentry the git revert command that you posted did bring back a number of .xlsm files, but I was unable to process the excel files into CSVs. Am I calling the wrong script here, or is there another way to turn the Excel files into CSVs?

(solutions) solutions $ python3 tools/solution_xls_extract.py --outputdir temp --excelfile  solution/peatlands/testdata/Drawdown-Peatland\ Protection_BioS.Prot_v1.1_3Jan2019_PUBLIC.xlsm 
Traceback (most recent call last):
  File "tools/solution_xls_extract.py", line 1666, in <module>
    output_solution_python_file(outputdir=outputdirpath, xl_filename=args.excelfile)
  File "tools/solution_xls_extract.py", line 1426, in output_solution_python_file
    wb = xlrd.open_workbook(filename=xl_filename)
  File "/home/eric/.local/share/virtualenvs/solutions-BFrQ741t/lib/python3.8/site-packages/xlrd/__init__.py", line 148, in open_workbook
    bk = book.open_workbook_xls(
  File "/home/eric/.local/share/virtualenvs/solutions-BFrQ741t/lib/python3.8/site-packages/xlrd/book.py", line 92, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "/home/eric/.local/share/virtualenvs/solutions-BFrQ741t/lib/python3.8/site-packages/xlrd/book.py", line 1278, in getbof
    bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
  File "/home/eric/.local/share/virtualenvs/solutions-BFrQ741t/lib/python3.8/site-packages/xlrd/book.py", line 1272, in bof_error
    raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'version '

FranzEricSchneider commented 4 years ago

Also, another clarifying question on what you expect. Inside of VMA._read_csv columns are renamed but there are also operations done on certain columns. Some examples:

self.df['Raw'] = xl_df['Raw Data Input'].apply(convert_percentages)
self.df['Exclude?'] = xl_df['Exclude Data?'].fillna(False)

Is this issue intended to bring only the column renaming into vma_xls_extract.py? Or should these operations also move into vma_xls_extract.py to produce consistent CSVs?

DentonGentry commented 4 years ago

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'version '

That looks like git LFS. The xlsm files are big enough that large file storage would be handling them. The usual git heap only holds a small record indicating where LFS would go to find the file. I think using "git lfs install", then removing your reverted commit and doing the revert again (to let LFS have a chance to get involved) would resolve it.

We started using LFS because the git object heap had grown to ~1 GByte due to all of the big files, which weren't even in the current HEAD but had to be kept in the history.

I typically use: python3 tools/solution_xls_extract.py --outputdir solution/peatlands If not given a specific --excelfile argument, it will look in ${outputdir}/testdata to find one.

DentonGentry commented 4 years ago

Is this issue intended to bring only the column renaming into vma_xls_extract.py? Or should these operations also move into vma_xls_extract.py to produce consistent CSVs?

The issue as originally filed would be to move both the renaming and column operations into vma_xls_extract.py

However, once you're to the point of being able to look at the XLSM file and the format of the "VMA Data" sheet, it is worth considering whether to instead implement issue #2. We would:

make a new file in solution/\<name>/vma_data/VMA.xlsx or similar name. The spreadsheet model you're working with is XLSM because it uses macros, but the VMAs alone shouldn't use macros.
copy the VMA Data sheet from the original Excel file into solution/\<name>/vma_data/VMA.xlsx
make model/vma.py find the VMA it is looking for in VMA.xlsx

I bring this up because I'm finding that even having the files be CSV is posing an issue with acceptance among researchers trying to use this. Allowing them to work with Excel files, where things like column widths or colors in cells will be preserved, can honestly help.

We don't want to require that a copy of Excel be purchased, but at this point the XLSX format is supported widely by other desktop software and by Google Docs/Zoho/Office 365 Free/etc.

FranzEricSchneider commented 4 years ago

That git lfs install step was perfect, thanks. It makes sense that you're using git lfs, I just didn't realize it had that startup step. The Excel files look good after that.

That's good to know, it sounds like #2 would be preferable to just this ticket alone. We'll take a look at it!

ProjectDrawdown / solutions

Rationalize VMA dataframe #7