Ethan-Russell commented 1 year ago

Determine what is relevant data to output, and output to a helpful data structure.

Ethan-Russell commented 1 year ago

Here is a benchmark of a few common data structures, as well as the file used to make the benchmarks:

name	Read Time	Write Time	Filter Time	File Size
SQLite.jl	0.46877	0.27787475	0.11413145	2056192
Arrow.jl	0.00214215	0.0360178	0.1890002	1661834
XLSX.jl	1.3234326	17.8287769	1.5751236	2572148
CSV.jl	0.0234176	0.0542643	0.0393511	3888093
Serialization.jl	0.0011739	0.0211353	0.012326	1627562

using DataFrames
using BenchmarkTools
using Test
import Arrow
import CSV
import XLSX
import JSON
import BSON
import YAML
import SQLite
import SQLite.DBInterface
using Serialization
using Query

# Create a mix of test data to save.
# mat = rand(2000, 200)
strs = ["nuclear", "geothermal", "natural_gas", "direct_air_capture"]
nr = 2000
nc = 100
df = DataFrame("gentype"=>rand(strs, nr), "idx"=>1:nr, ["x_$n"=>rand(nr) for n in 1:nc]...)

# in_file = "L:/Project-Gurobi/Workspace3/E4ST/Output/Results/prj_aet_mc/aet_init/us_can/result_aet_init_us_can_Y2029_gen_res.xlsx"
# df = XLSX.readtable(in_file, "annual")

res = DataFrame(
    "name"=>String[],
    "Read Time"=>Float64[],
    "Write Time"=>Float64[],
    "Filter Time"=>Float64[],
    "File Size"=>Int64[]
)

@testset "Test Data Saving" begin
    n = "SQLite.jl"
    @testset "$n" begin
        filename = "testfile.sqlite"
        println("$n\n", "#"^80)
        println("Saving Table")
        bm_write = @benchmark (db = SQLite.DB($filename); $df |> SQLite.load!(db, "mytable"); close(db)) setup=(rm($filename, force=true))

        println("Loading Table")
        bm_read = @benchmark (db=SQLite.DB($filename); DBInterface.execute(db, "SELECT * FROM mytable") |> DataFrame; close(db))
        db = SQLite.DB(filename)
        dfnew = DBInterface.execute(db, "SELECT * FROM mytable") |> DataFrame
        close(db)

        bm_filter = @benchmark (db = SQLite.DB("testfile.sqlite"); DBInterface.execute(db, "SELECT * FROM mytable WHERE gentype = 'direct_air_capture' AND x_1 > 0.5 AND x_2 < 0.5") |> DataFrame; close(db))

        @test dfnew == df
        println("Filesize: $(filesize(filename))")
        push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
    end

    n = "Arrow.jl"
    @testset "$n" begin
        filename = "testfile.arrow"
        println("$n\n", "#"^80)
        println("Saving Table")
        bm_write = @benchmark Arrow.write($filename, $df) setup=(rm($filename, force=true)) evals=1

        println("Loading Table")
        bm_read = @benchmark Arrow.Table($filename) |> DataFrame
        dfnew = Arrow.Table(filename) |> DataFrame

        bm_filter = @benchmark Tables.datavaluerows(Arrow.Table("testfile.arrow")) |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame

        @test dfnew == df
        println("Filesize: $(filesize(filename))")
        push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
    end

    n = "XLSX.jl"
    @testset "$n" begin
        filename = "testfile.xlsx"
        println("$n:\n", "#"^80)
        println("Saving Table")
        bm_write = @benchmark XLSX.writetable($filename, "SHEET1"=>$df) setup=(rm($filename, force=true)) evals=1

        println("Loading Table")
        bm_read = @benchmark XLSX.readtable($filename, "SHEET1") |> DataFrame
        dfnew = XLSX.readtable(filename, "SHEET1") |> DataFrame

        bm_filter = @benchmark XLSX.readtable($filename, "SHEET1") |> DataFrame |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame
        @test dfnew == df

        println("Filesize: $(filesize(filename))")
        push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
    end

    n = "CSV.jl"
    @testset "$n" begin
        filename = "testfile.csv"
        println("$n:\n", "#"^80)
        println("Saving Table")
        bm_write = @benchmark CSV.write($filename, $df) setup=(rm($filename, force=true)) evals=1

        println("Loading Table")
        bm_read = @benchmark CSV.File($filename) |> DataFrame
        dfnew = CSV.File(filename) |> DataFrame

        bm_filter = @benchmark CSV.File("testfile.csv") |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame

        @test dfnew == df
        println("Filesize: $(filesize(filename))")
        push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
    end
    n = "Serialization.jl"
    @testset "$n" begin
        filename = "testfile.jls"
        println("$n\n", "#"^80)
        println("Saving Table")
        bm_write = @benchmark serialize($filename, $df) setup=(rm($filename, force=true)) evals=1

        println("Loading Table")
        bm_read = @benchmark deserialize($filename)
        dfnew = deserialize(filename)

        bm_filter = @benchmark deserialize($filename) |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame

        @test dfnew == df
        println("Filesize: $(filesize(filename))")
        push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
    end
end

transform!(res, 
    "Write Time"=>(x->x./1e9)=>"Write Time",
    "Read Time"=>(x->x./1e9)=>"Read Time",
    "Filter Time"=>(x->x./1e9)=>"Filter Time"
    )
CSV.write("res.csv", res)

Ethan-Russell commented 1 year ago

If we wanted to go the database route, we would want something relational, like SQL.

For DB client, look into:

SQLite
PostGres

For viewers, here are a couple web apps that we could set up and get views into the data:

Looker
Amplitude

A database would allow us to have all results stored in the same place, then query it to get the information that we need. I could imagine us having a query file that we run that retrieves all the typical results we'd want for a set of simulations, or for a single run, etc.

Ethan-Russell commented 1 year ago

I discovered that Serialization.jl (part of julia's standard library) is extremely efficient at saving and loading arbitrary julia objects (including custom things like Modifications). This means, we can save our things at low cost throughout the process. As long as the structs don't change, this should work out of the box! I added it to the comparison above.

It is an opaque file format, meaning we can't view the data unless we deserialize it in julia. That means we may not want to save it to look at, but rather just for manipulating the full results/data after the fact. We'd want to test this first, but, we should be able to even have access to the JuMP model afterwards!

Ethan-Russell commented 1 year ago

In summary, for saving tabular outputs that we want to be able to quickly reference in a human-readable way, we should probably use CSV.jl. For saving arbitrary outputs (i.e. full results/data) that we want to be able to load in again but don't care about opening in excel, we should definitely use Serialization.jl.

sallyrobson commented 1 year ago

Discussion with Molly (12/21/22) (full notes, exerts in other issues as well)

Check that things are going correct in the model, more important internally to E4ST
Standard: Electricity demand, price, capacity, generation mix
Map output with subsections, emissions(?), mapping types of generation but need to be clear about accuracy of generator location
Standardize comparison between multiple runs. Difference in comparison between over time vs between cases. More built in differences.
Table of fixed assumptions good for people who are interested in the results. Built in notes or reasoning for why certain assumptions might be that way (for instance if a type of generation is constrained or not represented, note that it is so that they don't interpret the results wrong). Deviations from base (this will require getting clear on what the base is).
Standard output for the average price and distribution of prices. Time of use or price over time, maximum price (and time?)
Is load curtailed and how much, two axis graph of curtailment and electricity price.
Stacked generation mix over time. Differences between runs, additional capacity.
Generally difference between time series results and case comparison results.
Total and per MW or MWh as easy options
Easy state level representation which can be aggregated up, most people are interested in state or federal policy level. More granular would be mapping.
Good standard bar graph, pie chart, line chart (emissions, costs), tables
Focus on things that we are confident they won't be misinterpreted (try to not include things that are not modeled in the run, ex: don't have DAC in the gen mix graph if it's not turned on, filter out unused things)
Excel is nice for the flexibility to filter and do some analysis with it. Report ready results could be in fixed table/graph but working things should go in a flexible format.
Excel sheets can be intermediate ways to view things and processing can still be done in the code instead.

Ethan-Russell commented 1 year ago

Different uses of outputs:

Sanity check
- Table of fixed assumptions of E4ST
Standard results
- electricity demand, price, generation mix
- Built-in note on zero'd out outputs - i.e. DAC is 0
Map output!!!
Standardized way to compare runs
Distribution of prices
Curtailment by region
Curtailment vs LMP
generation mix over time

Ethan-Russell commented 1 year ago

This is mostly implemented in #50.

[ ] Still need to save the solved model

Ethan-Russell commented 1 year ago

In working on this, I'm making some observations.

Raw Results

The model variable has a ton of information stored inside, and it is pretty slow to access it. All the below examples are with the 3-bus model

julia> @btime value.($model[:egen_gen]);
  17.100 μs (531 allocations: 10.22 KiB)

The model is also very space-intensive. If we extract all the values and shadow prices of all the variables, expressions, and constraints and dump them into a dictionary, raw, that could be accessed in the same way as model, we get the following sizes:

julia> Base.summarysize(model)
420295
julia> Base.summarysize(raw)
10576
julia> Base.summarysize(data)
13300

That shows us that the raw solution is much smaller than the size of the model. That's a big difference. We may wish to save those raw results directly first, rather than saving the entire model. I would propose we save those raw results by default. Then, we should have an easy way to process those results without the model. Maybe a helper function like:

function process_results!(config)
    # Loads the data and raw results from file locations in the config file.
    # Computes aggregated results from config, data, raw, saving things to CSV's where applicable
    # Serializes aggregated results to file.
end

Aggregation

We need some way to aggregate the way results are saved, both in space and in time. In Matlab-based E4ST, we mostly aggregated annually, and had different sheets that aggregated in different regions. Since we now have multiple years, we may have different ways that we want to aggregate, and it will be critical to give the user easy controls to specify how that should work.

Regional aggregation use cases:
- I could see needing to aggregate some information by region and displaying each region
- I could also see the need to only report the aggregated data for a single region
- Aggregate by generator type, or load type, etc.
Temporal aggregation use cases:
- Total - aggregate across all time
- Annual - give the yearly aggregation
- Hourly - Give every time block for each year
- Seasonal - Aggregate across chunks of hours - months, season, etc. for each year

Different types of data need to be aggregated in different ways

Simple sum. Things with quantities in terms of "dollars", "short tons emitted", etc.
Simple average. Not too many of these things that I can think of.
weighted sum. Say I have an emission rate and want to compute overall emissions. I would sum the emission rate weighted by energy generation. I may need to weight by energy generation, power capacity, (anything else?).
weighted average. Say I have an emission rate and want to compute average emission rate. I would sum the emission rate weighted by energy generation, and divide by the total generation. I may need to weight by energy generation, power capacity, (anything else?).

Ethan-Russell commented 1 year ago

Standard outputs for RFF's usage of E4ST (not necessarily everywhere though)

Spatial
- Often reported by state
Temporal
- Not very often that we have needed to aggregate temporally
- For multiyear simulations, probably want to aggregate by year
- May want to aggregate ozone vs non-ozone season
We may want to allow for subsets of the data. For example, aggregating by season, but only want to see winter and summer.

Ethan-Russell commented 1 year ago

I'm thinking about what sort of structure we'd need to represent this.

struct AggregateResult
filename::String
table # to specify which table we are aggregating results for.  Could be bus, branch, or gen.
area # to specify how we would like to aggregate results spatially.  Could be "country", "state", "gentype", or left blank for every index
subarea # to specify which subarea(s) we would like to show the results for.  Default could be "", which would mean all subareas.  Could be a single subarea or vector of subareas.
temporal # to specify how to aggregate temporally.  Could be "total", "yearly", "hourly", or one of the columns of the hours table.
properties # An OrderedDict of properties to compute.
end

sallyrobson commented 1 year ago

We talked about how to deal with capex for existing generators. In current E4ST, CAP_COST and capex for existing gens is set to 0 and only new gens see a CAP_COST (which gets added into the offer price in the capacity expansion). This works for the optimization because existing capacity is constrained to only retire and not grow. The decision to retire a generator doesn't involve capex because that is theoretically still being paid and so only FOM matter for the change in capacity. In the optimization this means that it doesn't need to see capex when deciding to change capacity for existing gens.

For our results processing, we need to decide how we want to handle capex. To me it seems important to have the capex of the start year documented somewhere although I'm not sure how it would be used in the results. One idea is to have another column in the gen table called :capex_obj which is the capex used in the opimization/objective function. For newgens this would be the capex, and for existing/already built gens this would be capex of the sim year (which is also their start year).

Ethan-Russell commented 1 year ago

We talked about how to deal with capex for existing generators. In current E4ST, CAP_COST and capex for existing gens is set to 0 and only new gens see a CAP_COST (which gets added into the offer price in the capacity expansion). This works for the optimization because existing capacity is constrained to only retire and not grow. The decision to retire a generator doesn't involve capex because that is theoretically still being paid and so only FOM matter for the change in capacity. In the optimization this means that it doesn't need to see capex when deciding to change capacity for existing gens.

For our results processing, we need to decide how we want to handle capex. To me it seems important to have the capex of the start year documented somewhere although I'm not sure how it would be used in the results. One idea is to have another column in the gen table called :capex_obj which is the capex used in the opimization/objective function. For newgens this would be the capex, and for existing/already built gens this would be capex of the sim year (which is also their start year).

This would be in welfare/accounting, not necessarily the standard outputs, so I don't think we need it figured out for milestone 1. Another option is to:

Have the capex price per MW capacity, and economic lifetime as inputs to E4ST (rather than hourized annual capex cost)
optimization - The annual capex cost per MW capacity would get calculated using the econ lifetime and discount rate (will be an input already, for the objective function). This is what would get used by unbuilt generators in the objective function. The same value for all years.
accounting - The accounted-for capex cost could be the capex price per MW capacity times the invested capacity. So this would be a single lump-sum payment for the generator, paid in the investment year.

Ethan-Russell commented 1 year ago

This is pretty much finished, a bulky issue, not very helpful to keep around.

e4st-dev / E4ST.jl

Save Output Data #34

Discussion with Molly (12/21/22) (full notes, exerts in other issues as well)

Different uses of outputs:

Raw Results

Aggregation