Closed Ethan-Russell closed 1 year ago
Here is a benchmark of a few common data structures, as well as the file used to make the benchmarks:
name | Read Time | Write Time | Filter Time | File Size |
---|---|---|---|---|
SQLite.jl | 0.46877 | 0.27787475 | 0.11413145 | 2056192 |
Arrow.jl | 0.00214215 | 0.0360178 | 0.1890002 | 1661834 |
XLSX.jl | 1.3234326 | 17.8287769 | 1.5751236 | 2572148 |
CSV.jl | 0.0234176 | 0.0542643 | 0.0393511 | 3888093 |
Serialization.jl | 0.0011739 | 0.0211353 | 0.012326 | 1627562 |
using DataFrames
using BenchmarkTools
using Test
import Arrow
import CSV
import XLSX
import JSON
import BSON
import YAML
import SQLite
import SQLite.DBInterface
using Serialization
using Query
# Create a mix of test data to save.
# mat = rand(2000, 200)
strs = ["nuclear", "geothermal", "natural_gas", "direct_air_capture"]
nr = 2000
nc = 100
df = DataFrame("gentype"=>rand(strs, nr), "idx"=>1:nr, ["x_$n"=>rand(nr) for n in 1:nc]...)
# in_file = "L:/Project-Gurobi/Workspace3/E4ST/Output/Results/prj_aet_mc/aet_init/us_can/result_aet_init_us_can_Y2029_gen_res.xlsx"
# df = XLSX.readtable(in_file, "annual")
res = DataFrame(
"name"=>String[],
"Read Time"=>Float64[],
"Write Time"=>Float64[],
"Filter Time"=>Float64[],
"File Size"=>Int64[]
)
@testset "Test Data Saving" begin
n = "SQLite.jl"
@testset "$n" begin
filename = "testfile.sqlite"
println("$n\n", "#"^80)
println("Saving Table")
bm_write = @benchmark (db = SQLite.DB($filename); $df |> SQLite.load!(db, "mytable"); close(db)) setup=(rm($filename, force=true))
println("Loading Table")
bm_read = @benchmark (db=SQLite.DB($filename); DBInterface.execute(db, "SELECT * FROM mytable") |> DataFrame; close(db))
db = SQLite.DB(filename)
dfnew = DBInterface.execute(db, "SELECT * FROM mytable") |> DataFrame
close(db)
bm_filter = @benchmark (db = SQLite.DB("testfile.sqlite"); DBInterface.execute(db, "SELECT * FROM mytable WHERE gentype = 'direct_air_capture' AND x_1 > 0.5 AND x_2 < 0.5") |> DataFrame; close(db))
@test dfnew == df
println("Filesize: $(filesize(filename))")
push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
end
n = "Arrow.jl"
@testset "$n" begin
filename = "testfile.arrow"
println("$n\n", "#"^80)
println("Saving Table")
bm_write = @benchmark Arrow.write($filename, $df) setup=(rm($filename, force=true)) evals=1
println("Loading Table")
bm_read = @benchmark Arrow.Table($filename) |> DataFrame
dfnew = Arrow.Table(filename) |> DataFrame
bm_filter = @benchmark Tables.datavaluerows(Arrow.Table("testfile.arrow")) |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame
@test dfnew == df
println("Filesize: $(filesize(filename))")
push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
end
n = "XLSX.jl"
@testset "$n" begin
filename = "testfile.xlsx"
println("$n:\n", "#"^80)
println("Saving Table")
bm_write = @benchmark XLSX.writetable($filename, "SHEET1"=>$df) setup=(rm($filename, force=true)) evals=1
println("Loading Table")
bm_read = @benchmark XLSX.readtable($filename, "SHEET1") |> DataFrame
dfnew = XLSX.readtable(filename, "SHEET1") |> DataFrame
bm_filter = @benchmark XLSX.readtable($filename, "SHEET1") |> DataFrame |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame
@test dfnew == df
println("Filesize: $(filesize(filename))")
push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
end
n = "CSV.jl"
@testset "$n" begin
filename = "testfile.csv"
println("$n:\n", "#"^80)
println("Saving Table")
bm_write = @benchmark CSV.write($filename, $df) setup=(rm($filename, force=true)) evals=1
println("Loading Table")
bm_read = @benchmark CSV.File($filename) |> DataFrame
dfnew = CSV.File(filename) |> DataFrame
bm_filter = @benchmark CSV.File("testfile.csv") |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame
@test dfnew == df
println("Filesize: $(filesize(filename))")
push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
end
n = "Serialization.jl"
@testset "$n" begin
filename = "testfile.jls"
println("$n\n", "#"^80)
println("Saving Table")
bm_write = @benchmark serialize($filename, $df) setup=(rm($filename, force=true)) evals=1
println("Loading Table")
bm_read = @benchmark deserialize($filename)
dfnew = deserialize(filename)
bm_filter = @benchmark deserialize($filename) |> @filter(_.gentype == "direct_air_capture" && _.x_1 > 0.5 && _.x_2 < 0.5) |> DataFrame
@test dfnew == df
println("Filesize: $(filesize(filename))")
push!(res, (n, median(bm_read.times), median(bm_write.times), median(bm_filter.times), filesize(filename)))
end
end
transform!(res,
"Write Time"=>(x->x./1e9)=>"Write Time",
"Read Time"=>(x->x./1e9)=>"Read Time",
"Filter Time"=>(x->x./1e9)=>"Filter Time"
)
CSV.write("res.csv", res)
If we wanted to go the database route, we would want something relational, like SQL.
For DB client, look into:
For viewers, here are a couple web apps that we could set up and get views into the data:
A database would allow us to have all results stored in the same place, then query it to get the information that we need. I could imagine us having a query file that we run that retrieves all the typical results we'd want for a set of simulations, or for a single run, etc.
I discovered that Serialization.jl (part of julia's standard library) is extremely efficient at saving and loading arbitrary julia objects (including custom things like Modifications). This means, we can save our things at low cost throughout the process. As long as the structs don't change, this should work out of the box! I added it to the comparison above.
It is an opaque file format, meaning we can't view the data unless we deserialize it in julia. That means we may not want to save it to look at, but rather just for manipulating the full results/data after the fact. We'd want to test this first, but, we should be able to even have access to the JuMP model afterwards!
In summary, for saving tabular outputs that we want to be able to quickly reference in a human-readable way, we should probably use CSV.jl. For saving arbitrary outputs (i.e. full results/data) that we want to be able to load in again but don't care about opening in excel, we should definitely use Serialization.jl.
This is mostly implemented in #50.
In working on this, I'm making some observations.
The model
variable has a ton of information stored inside, and it is pretty slow to access it. All the below examples are with the 3-bus model
julia> @btime value.($model[:egen_gen]);
17.100 μs (531 allocations: 10.22 KiB)
The model
is also very space-intensive. If we extract all the values and shadow prices of all the variables, expressions, and constraints and dump them into a dictionary, raw
, that could be accessed in the same way as model
, we get the following sizes:
julia> Base.summarysize(model)
420295
julia> Base.summarysize(raw)
10576
julia> Base.summarysize(data)
13300
That shows us that the raw solution is much smaller than the size of the model. That's a big difference. We may wish to save those raw results directly first, rather than saving the entire model. I would propose we save those raw results by default. Then, we should have an easy way to process those results without the model. Maybe a helper function like:
function process_results!(config)
# Loads the data and raw results from file locations in the config file.
# Computes aggregated results from config, data, raw, saving things to CSV's where applicable
# Serializes aggregated results to file.
end
We need some way to aggregate the way results are saved, both in space and in time. In Matlab-based E4ST, we mostly aggregated annually, and had different sheets that aggregated in different regions. Since we now have multiple years, we may have different ways that we want to aggregate, and it will be critical to give the user easy controls to specify how that should work.
Different types of data need to be aggregated in different ways
Standard outputs for RFF's usage of E4ST (not necessarily everywhere though)
I'm thinking about what sort of structure we'd need to represent this.
struct AggregateResult
filename::String
table # to specify which table we are aggregating results for. Could be bus, branch, or gen.
area # to specify how we would like to aggregate results spatially. Could be "country", "state", "gentype", or left blank for every index
subarea # to specify which subarea(s) we would like to show the results for. Default could be "", which would mean all subareas. Could be a single subarea or vector of subareas.
temporal # to specify how to aggregate temporally. Could be "total", "yearly", "hourly", or one of the columns of the hours table.
properties # An OrderedDict of properties to compute.
end
We talked about how to deal with capex for existing generators. In current E4ST, CAP_COST and capex for existing gens is set to 0 and only new gens see a CAP_COST (which gets added into the offer price in the capacity expansion). This works for the optimization because existing capacity is constrained to only retire and not grow. The decision to retire a generator doesn't involve capex because that is theoretically still being paid and so only FOM matter for the change in capacity. In the optimization this means that it doesn't need to see capex when deciding to change capacity for existing gens.
For our results processing, we need to decide how we want to handle capex. To me it seems important to have the capex of the start year documented somewhere although I'm not sure how it would be used in the results. One idea is to have another column in the gen table called :capex_obj which is the capex used in the opimization/objective function. For newgens this would be the capex, and for existing/already built gens this would be capex of the sim year (which is also their start year).
We talked about how to deal with capex for existing generators. In current E4ST, CAP_COST and capex for existing gens is set to 0 and only new gens see a CAP_COST (which gets added into the offer price in the capacity expansion). This works for the optimization because existing capacity is constrained to only retire and not grow. The decision to retire a generator doesn't involve capex because that is theoretically still being paid and so only FOM matter for the change in capacity. In the optimization this means that it doesn't need to see capex when deciding to change capacity for existing gens.
For our results processing, we need to decide how we want to handle capex. To me it seems important to have the capex of the start year documented somewhere although I'm not sure how it would be used in the results. One idea is to have another column in the gen table called :capex_obj which is the capex used in the opimization/objective function. For newgens this would be the capex, and for existing/already built gens this would be capex of the sim year (which is also their start year).
This would be in welfare/accounting, not necessarily the standard outputs, so I don't think we need it figured out for milestone 1. Another option is to:
This is pretty much finished, a bulky issue, not very helpful to keep around.
Determine what is relevant data to output, and output to a helpful data structure.