For the parameters table, after the reference case with fixed values, all of the case values [0-23] are listed again, but now with triangular distributions provided in most cases, but changes case-to-case (e.g Power Electronics 1, Power Electronics 2, Soft Costs 1, Soft Costs 2, etc.) are only made in specific factors for that case—e.g., for inverter capital, lifetime, and efficiency advancing to mid and high performance, as highlighted below. How difficult would it be to limit the XLSX spreadsheet changes for the cases to those values that are changed from case-to-case while using the Offset to point to the specific corresponding variable? This would greatly reduce the length of the spreadsheet and make it much easier to identify the changes.
Investigation notes
Set up branch issue-151 for exploratory coding and debugging.
Structure
Flow for this functionality would be:
User creates input datasets: The user would either be required to include a base case (Tranche name would likely be mandatory) or to completely specify every design including values that are identical to the base case (i.e. how the datasets currently work).
Potential for human error: A user could neglect to add values to the non-base-case designs when those values actually differ, resulting in values that should be different from the base case being identical to the base case. This is less likely when values are being derived from expert elicitation, potentially more likely when values are point estimates pulled from literature or similar.
A check is run during data validation to determine if values duplication is needed. If yes, the duplication feature is executed and a "filled in" version of the input data file is created and saved to the same location as the original.
Possible synergy with #152 : We don't want to force the users to acknowledge or confirm the filled-in data but with the input data viz functionality, users would be able to verify that the non-base-case designs are specified correctly.
Once the non-base-case designs have been completed, they're instantiated within Tyche as well as saved to XLSX. At this point analysis can begin and proceed as currently implemented.
How?
Idea 1 - this is the better idea:
Separate the input dataset into separate DFs for every design - by Tranche
An error should still be thrown if there isn't at least one row associated with each Tranche in designs and parameters
If there's no data at all for a design associated with a Tranche, then that Tranche by definition is the base case (we'd be saying implicitly that the result of that Tranche is no change in the technology - in which case why would we evaluate it?)
Could pull on the vectorize_xyz methods throughout IO.py to do some of the matching/IDing of missing data
Dataframe join between the base case and every other design, using indexes from base case (left join)
The join creates rows with NaN values under the non-base case design, as a new column. Backfill these with values from the base case.
Melt the joined data structure so there's one column with all the design Values
Idea 2 (more manual and probably slower version of Idea 1):
Add rows for non base case designs until every design has every component in it (Variables, Names, Indexes, Offsets, etc)
Then backfill using values from base case
What?
(which datasets need this functionality)
designs
parameters
Where?
(where should the code for this functionality be located)
IO.py
Can integrate with the check_tables method - if certain checks fail, that can trigger duplication. Other checks failing could still mean invalid data (i.e. if the base case design doesn't have mandatory Variables)
Steps
[ ] Review the input data validations and flag any that need to be altered, removed, or replaced to work with this functionality (consider running the full set of validations even on datasets that have been run through duplication, as a check on the duplication during dev)
[ ] Add a check during input data validation: is there a "Base Case" Tranche? If not, the rest of the validations are run as they currently stand and the duplication functionality is not triggered.
[ ] Add in detection logic to trigger the duplication functionality (only run duplication if there is a base case and if non-base-case designs are missing information that is in the base case)
[ ] Create method to store duplication functionality
[ ] Write split/join/backfill code: split on Technology-Tranche, join Base Case with all other Tranche designs, backfill missing data with the corresponding info from the Base Case, then concat back to the full designs and parameters dataset
Time estimate
5-8 hours on dev work including writing the code, creating a test dataset, and initial debugging
another 1-2 hours on additional testing and debugging
From Sam:
For the parameters table, after the reference case with fixed values, all of the case values [0-23] are listed again, but now with triangular distributions provided in most cases, but changes case-to-case (e.g Power Electronics 1, Power Electronics 2, Soft Costs 1, Soft Costs 2, etc.) are only made in specific factors for that case—e.g., for inverter capital, lifetime, and efficiency advancing to mid and high performance, as highlighted below. How difficult would it be to limit the XLSX spreadsheet changes for the cases to those values that are changed from case-to-case while using the Offset to point to the specific corresponding variable? This would greatly reduce the length of the spreadsheet and make it much easier to identify the changes.
Investigation notes
Set up branch
issue-151
for exploratory coding and debugging.Structure
Flow for this functionality would be:
How?
Idea 1 - this is the better idea:
vectorize_xyz
methods throughout IO.py to do some of the matching/IDing of missing dataIdea 2 (more manual and probably slower version of Idea 1):
What?
(which datasets need this functionality)
Where?
(where should the code for this functionality be located)
check_tables
method - if certain checks fail, that can trigger duplication. Other checks failing could still mean invalid data (i.e. if the base case design doesn't have mandatory Variables)Steps
Time estimate
5-8 hours on dev work including writing the code, creating a test dataset, and initial debugging another 1-2 hours on additional testing and debugging