catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

ferc1_to_sqlite not finding F1_PUB.DBC #749

Closed grgmiller closed 3 years ago

grgmiller commented 3 years ago

Describe the bug

I am running ferc1_to_sqlite settings/ferc1_to_sqlite.yml --sandbox using the development setup (pudl based on current dev branch) and am getting the following error:

2020-09-11 11:35:18 [    INFO] pudl.extract.ferc1:456 Dropping the old FERC Form 1 SQLite DB if it exists.
2020-09-11 11:35:18 [    INFO] pudl.extract.ferc1:469 Creating a new database schema based on 2018.
Traceback (most recent call last):
  File "c:\users\greg\github\pudl\src\pudl\extract\ferc1.py", line 111, in get_file
    f = z.open(dbc_path)
  File "C:\Users\Greg\Anaconda3\envs\pudl-dev\lib\zipfile.py", line 1514, in open
    zinfo = self.getinfo(name)
  File "C:\Users\Greg\Anaconda3\envs\pudl-dev\lib\zipfile.py", line 1441, in getinfo
    raise KeyError(
KeyError: "There is no item named 'UPLOADERS\\\\FORM1\\\\working\\\\F1_PUB.DBC' in the archive"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Greg\Anaconda3\envs\pudl-dev\Scripts\ferc1_to_sqlite-script.py", line 33, in <module>
    sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'ferc1_to_sqlite')())
  File "c:\users\greg\github\pudl\src\pudl\convert\ferc1_to_sqlite.py", line 108, in main
    pudl.extract.ferc1.dbf2sqlite(
  File "c:\users\greg\github\pudl\src\pudl\extract\ferc1.py", line 475, in dbf2sqlite
    dbc_map = get_dbc_map(ds, refyear)
  File "c:\users\greg\github\pudl\src\pudl\extract\ferc1.py", line 286, in get_dbc_map
    dbc = ds.get_file(year, "F1_PUB.DBC")
  File "c:\users\greg\github\pudl\src\pudl\extract\ferc1.py", line 113, in get_file
    raise KeyError(f"{dbc_path} is not available in {year} archive.")
KeyError: 'UPLOADERS\\FORM1\\working\\F1_PUB.DBC is not available in 2018 archive.'

Bug Severity

How badly is this bug affecting you?

To Reproduce

Steps to reproduce the behavior -- ideally including a code snippet that causes the error to appear. I downloaded the datastore locally using pudl_datastore --sandbox --verbose --loglevel DEBUG, and the ferc1 data is stored in a folder called "10.5072-zenodo.656695" Text doc containing yml file info: ferc1_to_sqlite.txt

Software Environment?

Additional context

Zane's Guess: My guess is that the real problem here is that your machine (Windows) has one way of specifying path separators, but inside the unzipped Zipfile, it's not a Windows filesystem, and it's expecting another type of path separator (forward-slash) or something like that.

zaneselvans commented 3 years ago

Hey @grgmiller I've been looking into how we could get these path related issues working on Windows, and more generally how we could run all of the automated tests on Windows through GitHub Actions, and it's not looking like a particularly straightforward process, given the kinds of packages that PUDL depends on.

Would you be willing to try installing the Windows Subsystem for Linux (WSL2 with Ubuntu version 20.04) and the Windows Terminal (to provide a command line interface to the WSL) and setting up PUDL within that environment? I suspect this will give us the most uniform environment, and make debugging issues easier, since the underlying PyData tools (and our development experience) are fairly centered on a unix-like OS environment, and we would be able to walk you through the setup process there in great detail.

For folks who are just users of PUDL, I think we're moving toward using Docker to standardize the software/OS environment, but I don't think this arrangement is going to work very well if you're also doing a bunch of development, editing the code, etc. Though maybe @rousik who is helping us work out the containerization stuff would know better.