Closed brenmous closed 4 years ago
There's also no integration tests or system tests, so none of the test suites cover parsing the config and providing the parameters to the workflow steps. This will have to change so that doing any sort of restructuring won't break UncoverML.
Given the size of the init for the Config class, and the fairly clear distinction between [covariate loading and scaling], [target loading and algorithm selection], [prediction output and formatting], we can split the parsing into these three chunks to make it easier to read. It may also be possible to only parse the portions we need depending on the command being run.
Changes to make:
I've gone through the config module in an attempt to document and understand it. There's some issues that lead me to want to restructure some aspects of the YAML file and the parsing.
Pickle settings:
Outbands:In the code outbands will get selected as the minimum of the provided outbands value and the actual number of band tags in the data. This isn’t made clear to the user. It also leads to using an arbitrarily high number (e.g. 10) being used to select all bands which is pretty vague (if someone provides 10 bands – are there 10 bands in the data? Or is it a shortcut to getting all bands? Hard to tell from looking at a config file).The outbands parameter is used as the stop for index slicing, e.g. bands[0: outbands]. Is this intentional? It means if you want the band at position [4] you can’t have it without getting bands 0, 1, 2 and 3. I understand band [0] is the predicition value so it makes to sense to always have it but what about the others?The example in the cookbook gives a range of 1:5 for outband which corresponds to regression outputs [mean, variance etc.] but what if you’re using a classifier with more than 6 classes?The method for selecting outbands needs to be looked at and made more explicit, or at the very least clearly documented.See #83
Lack of guards for accessing parameters:
There's many attempts to access configuration parameters throughout the code without checking if they've been set
e.g. optimisation requires the ‘optimisation_output’ parameter but there’s no check if it’s actually been provided until there’s an attempt to access it at the very end of the gridsearch script
So I want to implement checks for parameters when the config file is parsed to prevent these runtime errors caused by users forgetting parameters.
'Resample' not implemented:There's a 'resample' block in the YAML for performing target resampling but it doesn't appear to be implemented and the parameters aren't used in the code.See #71
Config object attributes being written outside of config module:
There's a 'cluster' attribute that gets set to the config object outside the config module - it occurs when running 'predict' and is based on whether the provided model ends with '.cluster'. It's bad form and creates complexity to be writing to the global configuration outside of the initial parsing. This instance (and others if they occur) should be remedied.
Randomforestregressor and cubist models use a temporary directory called 'results' for intermediate products (cross-val models etc). Should change this to actual system temp directory and hardcode it into config object as self.tmp_dir.See #66
'plot_covariates' - the plotting only occurs if 'rawcovariates' is also set (and pickled targets and features aren't loaded). This isn't explained, so some users may be setting 'plot_covariates' without setting 'rawcovariates' and wondering where their plots are.
learning
andoptimisation
blocks both have an algorithm parameter. This can be combined as it doesn't make sense to optimise for one algorithm but then train a model using another.