Restructure config file and parsing

brenmous commented 4 years ago

I've gone through the config module in an attempt to document and understand it. There's some issues that lead me to want to restructure some aspects of the YAML file and the parsing.

Pickle settings:

Only one set of pickled data is used but a user can provide multiple ‘features’ blocks in the yaml file of type ‘pickle’. Config parser reads all of them, but only the last one it reads will actually be used. We can just check and mandate that there’s only one block of ‘pickle’ features.
The ‘pickle’ block seems to be overloaded for reading/writing outputs not related to pickling the data, e.g. ‘plot_covariates’ and ‘rawcovariates’ are specified as part of pickled features in the yaml but have nothing to do with pickling. I found this confusing – would be better to have an explicit block for writing/plotting outputs like this.
Also not well explained: if the files in the pickle block don’t exist, features and targets get intersected from files specified in the other ‘feature’ blocks and then dumped using the pickle filenames. If they do exist then they get loaded – not very clearly explained and we should make a distinction between loading pickled data and dumping intersected data.

~~Outbands:~~

In the code outbands will get selected as the minimum of the provided outbands value and the actual number of band tags in the data. This isn’t made clear to the user. It also leads to using an arbitrarily high number (e.g. 10) being used to select all bands which is pretty vague (if someone provides 10 bands – are there 10 bands in the data? Or is it a shortcut to getting all bands? Hard to tell from looking at a config file).
The outbands parameter is used as the stop for index slicing, e.g. bands[0: outbands]. Is this intentional? It means if you want the band at position [4] you can’t have it without getting bands 0, 1, 2 and 3. I understand band [0] is the predicition value so it makes to sense to always have it but what about the others?
~~The example in the cookbook gives a range of 1:5 for outband which corresponds to regression outputs [mean, variance etc.] but what if you’re using a classifier with more than 6 classes?~~

~~The method for selecting outbands needs to be looked at and made more explicit, or at the very least clearly documented.~~

See #83

Lack of guards for accessing parameters:

There's many attempts to access configuration parameters throughout the code without checking if they've been set

e.g. optimisation requires the ‘optimisation_output’ parameter but there’s no check if it’s actually been provided until there’s an attempt to access it at the very end of the gridsearch script

So I want to implement checks for parameters when the config file is parsed to prevent these runtime errors caused by users forgetting parameters.

~~'Resample' not implemented:~~

~~There's a 'resample' block in the YAML for performing target resampling but it doesn't appear to be implemented and the parameters aren't used in the code.~~

See #71

Config object attributes being written outside of config module:

There's a 'cluster' attribute that gets set to the config object outside the config module - it occurs when running 'predict' and is based on whether the provided model ends with '.cluster'. It's bad form and creates complexity to be writing to the global configuration outside of the initial parsing. This instance (and others if they occur) should be remedied.

Randomforestregressor and cubist models use a temporary directory called 'results' for intermediate products (cross-val models etc). Should change this to actual system temp directory and hardcode it into config object as self.tmp_dir.

See #66

'plot_covariates' - the plotting only occurs if 'rawcovariates' is also set (and pickled targets and features aren't loaded). This isn't explained, so some users may be setting 'plot_covariates' without setting 'rawcovariates' and wondering where their plots are.

learning and optimisation blocks both have an algorithm parameter. This can be combined as it doesn't make sense to optimise for one algorithm but then train a model using another.

brenmous commented 4 years ago

There's also no integration tests or system tests, so none of the test suites cover parsing the config and providing the parameters to the workflow steps. This will have to change so that doing any sort of restructuring won't break UncoverML.

brenmous commented 4 years ago

Given the size of the init for the Config class, and the fairly clear distinction between [covariate loading and scaling], [target loading and algorithm selection], [prediction output and formatting], we can split the parsing into these three chunks to make it easier to read. It may also be possible to only parse the portions we need depending on the command being run.

brenmous commented 4 years ago

Changes to make:

separate output block
separate pickle block (overloaded for reading/writing)
rename preprocessing block
specific 'model' field for saving/loading model file

GeoscienceAustralia / uncover-ml

Restructure config file and parsing #63