OSeMOSYS / OSeMOSYS_PuLP

OSeMOSYS_PuLP: A Stochastic Modelling Framework for Long-Term Energy Systems Modeling
Apache License 2.0
14 stars 6 forks source link

Suggests to simplify code, performance enhancement, interoperability etc. #5

Open willu47 opened 4 years ago

willu47 commented 4 years ago

I've been looking into using OSeMOSYS PuLP to facilitate a project which needs some Monte Carlo runs. I've met a roadblock when trying to load data from another model programmatically. It's been quite difficult to understand the internal data structure currently used, but once I figured it out, it has proved difficult to modify cleanly.

Recommend using a dict of pandas DataFrames for parameters and sets

Currently, the data is read into a giant ragged pandas DataFrame which contains a lot of empty columns, and different parameters are listed on rows. This is not memory efficient, as you're reserving 13 columns of memory for each row.

I would recommend using a dictionary of dataframes, where the key is the parameter or set name, and the value the pandas DataFrame. Use long format for the dataframe (as you are currently using) e.g. for AccumulatedAnnualDemand you would have four columns: REGION, FUEL, YEAR and VALUE.

This would also match the format used by otoole which provides a number of handy import functions to convert between GNU MathProg models, and would make it very easy to begin using PuLP from other implementations.

Code structure

The createParameter, createTuple and createVariable helper functions exist primarily to deal with the storage of all the model elements in this giant DataFrame. These functions also obscure the otherwise straightforward syntax offered by PuLP.

The SETS and PARAMETERS AND DATA section are essentially more data munging - getting your data in shape before passinging it into the model. This could all be placed into a load_data function.

I'm a bit suspicious of PERMUTATION OF SETS because it looks like it creates the dense cross-product of all the sets, which means that you're then generating a dense matrix of constraints later on. If so, this is going to be extremely memory inefficient, and there will be many constraints that are not being used. The solver will then need to remove these before going onto solve the model - hurting performance and ability of PuLP to scale to larger models.

Model construction

In this section, you create a big dict containing all the specifications of the variables, and then create the variables using the createVariable function. Why not just call the PuLP function pulp.LpVariable directly? Do you need to pass in the cross-product of the sets to the variable creation function?

For the constraint generation, is it necessary to iterate over the cross-product of the indexes? In many cases, you only need to generate constraints for the indices for which you have data.

Also, it might simplify things to create functions for each constraint returning a pulp.LPConstraint type. This would also allow you to document what's happening in each constraint.

I also recommend wrapping the entire model in a create_model function which returns the model, and takes a package of data (e.g. the dict of DataFrames) as an argument. This would allow the model to be placed in a separate file, and make a clearer distinction between all the data management, running of the model, and model formulation itself.