Extra column in CSV file, but not in config fails to raise error

Hi @willu47! I was looking at this issue today, and have an implementation question I am hoping we can discuss!

My first thought was to expand the base ReadStrategy method, _check_index( .. ), to check both the datatypes and column headers of the input data. An (untested) thought on how to do this is shown in the code snippet below.

https://github.com/trevorb1/otoole/blob/f9b02733db62aa0da983c34168c54c63110c4657/src/otoole/input.py#L367-L395

The advantages of this implementation is that all further read strategies can use this method. The disadvantage is that all the data is first read in, then checked; this might not be ideal for very large data files.

Moreover, for some read strategies (specifically, anything under the ReadTabular class), the _check_parameter( .. ) method does some of this logic. It will identify if input data has an extra header not in the config file and automatically drops it. This logic only goes one way though; if a header is identified in the config file but is not in the input data, a KeyError is raised.

My question is: do you think proceeding with expanding the _check_index( .. ) method under the ReadStrategy class is the best option here? Or do you think adding logic in each specific read strategy is better (ie. make the _check_parameter( .. ) method more robust, and adding a similar one for ReadDataFile).

I think expanding the _check_index( .. ) method, and removing the checks from _check_parameter( .. ), is better, even if we take a slight efficiency hit. But I just want to get a second opinion before putting more work into this :) Thanks so much!

OSeMOSYS / otoole

Extra column in CSV file, but not in config fails to raise error #156

Set up

Expected behaviour

Actual behaviour