[RFC] Order-dependent transformations and internal data structures - Githubissues

etsap-TIMES / xl2times

Open source tool to convert TIMES models specified in Excel

https://xl2times.readthedocs.io/

MIT License

12 stars 7 forks source link

[RFC] Order-dependent transformations and internal data structures #41

Open siddharth-krishna opened 1 year ago

siddharth-krishna commented 1 year ago

There are some transformations (such as update tables and insertions) that are order-dependent (in that the output depends on the order of processing of the Excel files). Moreover, Excel files can be categorised into "types" or sets, e.g. the base VT files. Some computations should only look at tables in other files from the same set.

We probably need to change the type of our transforms from List[EmbeddedXlTable] -> List[EmbeddedXlTable] to something with a bit more structure. Perhaps we start off with a list of set of Excel files, then eventually move to a list of tables? Or we have a new class/datatype called e.g. TimesModelData which has all the appropriate fields.

We also need a command line input (or input from a file) that specifies for each run of our tool, the subset of Excel files from the input directory that should be read, and in what order. We can start off by piggybacking on the existing Veda JSON file, but it would be good to specify this input ourselves precisely.

Thoughts, @olejandro and @samwebster ?

olejandro commented 1 year ago

thanks @siddharth-krishna! For the sets of excel files we could use the categories from #39:

BASE: BY (./VT_*.xlsx) & BY_TRANS (./BY_Trans.xlsx)
SysSettings: (./SysSettings.xlsx)
SR_*: SubRES (./SubRESTMPL/SubRES*.xlsx) & SR_Trans(./SubRESTMPL/SubRES*_Trans.xlsx)
RegScen: (./SuppXLS/Scen_*.xlsx)
ParScen: (./SuppXLS/Scen_Par-*.xlsx)
TradeScen: (./SuppXLS/Trades/ScenTrade_*.xlsx)
SetRules: (./SetRules.xlsx)

When processing, we probably should skip ParScen (intended for generating multiple runs by varying certain data) and SetRules (intended mostly for creating sets for reporting results).

olejandro commented 1 year ago

For the order, we could just have a list of file names that should be processed, keeping the right order. In Veda, BASE is taken as all the VT files & BY_Trans in the root of the model folder. We could either do the same or make it explicit what's included.

olejandro commented 1 year ago

We should also include a list containing the names of regions that should be processed.

siddharth-krishna commented 1 year ago

I'm happy to piggy back on any existing files that specify the input files and their order. Where does this RUN file live? I don't see it in any of the Demos (e.g. in the demos-xlsx repository) or in the Ireland model.

Does the following first step make sense: add an input to our tool which is a list of Excel files to be processed in order. Use this order in all the existing transforms.

For the second step, what transforms need to be modified to only look at Excel files in a particular order / from a certain set?

olejandro commented 1 year ago

The contents of this file is similar to that of a RUN file used by Veda. It is also consistent with info in Groups.json for a particular scenario e.g. (https://github.com/MaREI-EPMG/times-ireland-model/blob/main/AppData/Groups.json). Actually this order is only important for GAMS, because some info could be overwritten by files that are read later. What's important for us from there at this stage is just the list of files used in a specific scenario. As far as I understand, the way we assess whether our tool processes the model files correctly doesn't suffer from inability to account for order in Groups.json (or RUN) file, because we only eliminate exact duplicates when the dd files are read for comparison purposes (@siddharth-krishna please confirm).

For processing purposes, we do need to take care of the alphabetical order of the files though, because e.g. a TFM_UPD table in a scenario file should generate output that is based on BASE, SubRES and scenario files that come before it. Actually, we may run into an inconsistency with Veda here, because Veda synchronises all the files selected by a user (can be different from what's in Groups.json) and then generates DD files based on a subset of them. However this may include output that is generated, based on files that are not part of a specific scenario. We won't have this problem if we don't include any scenario management capabilities in the tool (i.e. if we generate DD files based on a subset of model xlsx files every time). (@Antti-L please correct me if I am wrong in anything)

olejandro commented 1 year ago

Actually, on the last point, from https://veda-documentation.readthedocs.io/en/latest/pages/Migration.html: If qualifying values exist in multiple scenarios, only ones from the “last scenario”, like seed values for UPD/MIG tables, will be returned

Antti-L commented 6 months ago

@Olejandro : I noticed now you wished me to comment if relevant. :smile:

we do need to take care of the alphabetical order of the files though, because e.g. a TFM_UPD table in a scenario file should generate output that is based on BASE, SubRES and scenario files that come before it.

Yes, but note that TFM_UPDs may be also in Base and Subres transformation files, where they update only Base/Subres data.

Actually, we may run into an inconsistency with Veda here, because Veda synchronises all the files selected by a user (can be different from what's in Groups.json) and then generates DD files based on a subset of them. However this may include output that is generated, based on files that are not part of a specific scenario. We won't have this problem if we don't include any scenario management capabilities in the tool (i.e. if we generate DD files based on a subset of model xlsx files every time).

Well, I think you should process all model Excel templates, and generate DD files based on a subset of scenarios. That subset may depend on data in files that are not themselves included in the run, and so those other files would need to be processed anyway, would you not agree? I am not sure what you mean by "we won't have this problem if we don't include any scenario management capabilities in the tool". Does that mean that the xl2times tool would not support models where the data for a run case would depend on data in other model templates?

Antti-L commented 6 months ago

@olejandro

Actually this order is only important for GAMS, because some info could be overwritten by files that are read later. What's important for us from there at this stage is just the list of files used in a specific scenario. As far as I understand, the way we assess whether our tool processes the model files correctly doesn't suffer from inability to account for order in Groups.json (or RUN) file, because we only eliminate exact duplicates when the dd files are read for comparison purposes

I am a bit confused by what you say here. If the tool is supposed to write the data out for GAMS correctly, the order would be essential. If you eliminate duplicates across scenarios, they should be eliminated taking that order into account. Otherwise, you should write the data for each scenario separately (in DD files like VEDA) or in separate GAMS data blocks, but I have understood that at least currently only writing all data in a single file is supported, is that correct? So, are duplicates across scenarios currently eliminated (and by which order)?

olejandro commented 6 months ago

Thanks for your comments @Antti-L. I'll be soon picking up the development of this functionality, so they'll come in handy. I'll reference this issue in any related PR to make it easy to track where we are with this.