catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

data_to_sqlite not outputting complete sql db #487

Closed grgmiller closed 4 years ago

grgmiller commented 4 years ago

Describe the bug

When I run pudl.convert.datapkg_to_sqlite.pkg_to_sqlite_db(), the module starts creating a new pudl.sqlite file in my sqlite folder. Over 20ish minutes, I see the size of the pudl.sqlite file growing, to over 500MB. However, as soon as the module finishes executing, the size of the file drops back down to 127 kB, and appears to be empty.

Bug Severity

How badly is this bug affecting you? High:

To Reproduce

I have already run the ETL process to create a file ("2017_data" in my datapackage folder.

code: pudl_settings = pudl.workspace.setup.get_defaults() pudl.convert.datapkg_to_sqlite.pkg_to_sqlite_db(pudl_settings,'2017_data')

Expected behavior

I expect that a complete pudl.sqllite db file to be created

Software Environment?

Additional context

Add any other context about the problem here.

grgmiller commented 4 years ago

Note: this happened both when I ran datapkg_to_sqlite --pkg_bundle_name 2017_data in the command line, and when I ran the above code in python.

zaneselvans commented 4 years ago

Hmm, I haven't seen anything like that. Can you post the input YML file you're using that specifies which data sources and years to load? Are you loading the CEMS data? There was a bug at some point that only ends up loading the last partition of the CEMS, but it wouldn't end up giving you a totally empty DB. Something like this happens occasionally in the testing -- where the system has been told to clean up after itself after running the tests and drops the database if everything is successful, but that shouldn't happen when using the scripts.

grgmiller commented 4 years ago

Here's the YML code copied to a text file: 2017_data.txt

I am loading an entire year's worth of CEMS data.

Upon further inspecting the pudl.sqlite database, it looks like it retains all of the table names and column names, but erases all of the data... running epa_cems = pd.read_sql("""SELECT * FROM hourly_emissions_epacems""", pudl_engine) Gives me an empty dataframe with 19 columns.

zaneselvans commented 4 years ago

Hmmm... @cmgosnell this doesn't look like the CEMS last-partition-only bug, if the other requested tables also aren't getting loaded or retained at all. Does it behave differently if you only ask it to load the 2017 EIA860 data too? Having different years in different datasets (923 and 860) isn't something we've ever really tested.

# This file controls the PUDL ETL process, and is used as input to pudl_etl

pkg_bundle_name: 2017_data
pkg_bundle_settings:
  - name: epacems
    title: EPA Continuous Emissions Monitoring System Hourly
    description: Hourly emissions, power output, heat rates, and other data for most US fossil fuel plants.
    datasets:
      - eia:
          eia923_tables:
            - generation_fuel_eia923
            - boiler_fuel_eia923
            - generation_eia923
            - coalmine_eia923
            - fuel_receipts_costs_eia923
          eia923_years: [2017]
          eia860_tables:
            - boiler_generator_assn_eia860
            - utilities_eia860
            - plants_eia860
            - generators_eia860
            - ownership_eia860
          eia860_years: [2011,2012,2013,2014,2015,2016,2017]
      - epacems:
          epacems_years: [2017]
          epacems_states: [ALL]
          partition:
            hourly_emissions_epacems: epacems_years
zaneselvans commented 4 years ago

In any case @grgmiller I suspect that there's a good chance that whatever this issue is, it's been fixed in the current development version of PUDL.

grgmiller commented 4 years ago

When I ask it to load only the 2017 data for 860 in the YML file, I ran into this issue: https://github.com/catalyst-cooperative/pudl/issues/467 which is why I had been asking it to load all of the past 860 data.

When do you think that the current development version will be available?

zaneselvans commented 4 years ago

Ahh, right. It just happens to work on the Idaho data that we use in the CI tests because there are so few plants and they are all identifiable from the 2017 data.

We are going to do another software release with all of the updates before the end of the year.

zaneselvans commented 4 years ago

I'm thinking that this is all working for you now with the most recent release so Imma close this issue.