OSeMOSYS / otoole

OSeMOSYS Tools for Energy
https://otoole.readthedocs.io
MIT License
23 stars 17 forks source link

Extra column in CSV file, but not in config fails to raise error #156

Closed willu47 closed 1 year ago

willu47 commented 1 year ago

Set up

config.yaml:

Test:
    indices: [REGION]
    type: param
    dtype: float
    default: 0.05

data/Test.csv file:

REGION,TECHNOLOGY,VALUE
SIMPLICITY,BLA,0.05

Expected behaviour

After running

otoole convert csv datafile data test.txt config.yaml

I would expect an error to be raised because the csv file does not match the config.

Actual behaviour

The command runs successfully. The results datafile contains only the region column. Note that if there are multiple rows, then the region column in this case would contain duplicate values.

trevorb1 commented 1 year ago

Hi @willu47! I was looking at this issue today, and have an implementation question I am hoping we can discuss!

My first thought was to expand the base ReadStrategy method, _check_index( .. ), to check both the datatypes and column headers of the input data. An (untested) thought on how to do this is shown in the code snippet below.

https://github.com/trevorb1/otoole/blob/f9b02733db62aa0da983c34168c54c63110c4657/src/otoole/input.py#L367-L395

The advantages of this implementation is that all further read strategies can use this method. The disadvantage is that all the data is first read in, then checked; this might not be ideal for very large data files.

Moreover, for some read strategies (specifically, anything under the ReadTabular class), the _check_parameter( .. ) method does some of this logic. It will identify if input data has an extra header not in the config file and automatically drops it. This logic only goes one way though; if a header is identified in the config file but is not in the input data, a KeyError is raised.

My question is: do you think proceeding with expanding the _check_index( .. ) method under the ReadStrategy class is the best option here? Or do you think adding logic in each specific read strategy is better (ie. make the _check_parameter( .. ) method more robust, and adding a similar one for ReadDataFile).

I think expanding the _check_index( .. ) method, and removing the checks from _check_parameter( .. ), is better, even if we take a slight efficiency hit. But I just want to get a second opinion before putting more work into this :) Thanks so much!