Closed willu47 closed 1 year ago
Hi @willu47! I was looking at this issue today, and have an implementation question I am hoping we can discuss!
My first thought was to expand the base ReadStrategy
method, _check_index( .. )
, to check both the datatypes and column headers of the input data. An (untested) thought on how to do this is shown in the code snippet below.
The advantages of this implementation is that all further read strategies can use this method. The disadvantage is that all the data is first read in, then checked; this might not be ideal for very large data files.
Moreover, for some read strategies (specifically, anything under the ReadTabular
class), the _check_parameter( .. )
method does some of this logic. It will identify if input data has an extra header not in the config file and automatically drops it. This logic only goes one way though; if a header is identified in the config file but is not in the input data, a KeyError
is raised.
My question is: do you think proceeding with expanding the _check_index( .. )
method under the ReadStrategy
class is the best option here? Or do you think adding logic in each specific read strategy is better (ie. make the _check_parameter( .. )
method more robust, and adding a similar one for ReadDataFile
).
I think expanding the _check_index( .. )
method, and removing the checks from _check_parameter( .. )
, is better, even if we take a slight efficiency hit. But I just want to get a second opinion before putting more work into this :) Thanks so much!
Set up
config.yaml
:data/Test.csv
file:Expected behaviour
After running
I would expect an error to be raised because the csv file does not match the config.
Actual behaviour
The command runs successfully. The results datafile contains only the region column. Note that if there are multiple rows, then the region column in this case would contain duplicate values.