GeoscienceAustralia / uncover-ml

Machine Learning system for Geoscience Australia uncover project
Apache License 2.0
30 stars 20 forks source link

Restructure config file and parsing #63

Closed brenmous closed 4 years ago

brenmous commented 4 years ago

I've gone through the config module in an attempt to document and understand it. There's some issues that lead me to want to restructure some aspects of the YAML file and the parsing.


Pickle settings:


Outbands:

The method for selecting outbands needs to be looked at and made more explicit, or at the very least clearly documented.

See #83


Lack of guards for accessing parameters:

There's many attempts to access configuration parameters throughout the code without checking if they've been set

e.g. optimisation requires the ‘optimisation_output’ parameter but there’s no check if it’s actually been provided until there’s an attempt to access it at the very end of the gridsearch script

So I want to implement checks for parameters when the config file is parsed to prevent these runtime errors caused by users forgetting parameters.


'Resample' not implemented:

There's a 'resample' block in the YAML for performing target resampling but it doesn't appear to be implemented and the parameters aren't used in the code.

See #71


Config object attributes being written outside of config module:

There's a 'cluster' attribute that gets set to the config object outside the config module - it occurs when running 'predict' and is based on whether the provided model ends with '.cluster'. It's bad form and creates complexity to be writing to the global configuration outside of the initial parsing. This instance (and others if they occur) should be remedied.


Randomforestregressor and cubist models use a temporary directory called 'results' for intermediate products (cross-val models etc). Should change this to actual system temp directory and hardcode it into config object as self.tmp_dir.

See #66


'plot_covariates' - the plotting only occurs if 'rawcovariates' is also set (and pickled targets and features aren't loaded). This isn't explained, so some users may be setting 'plot_covariates' without setting 'rawcovariates' and wondering where their plots are.


learning and optimisation blocks both have an algorithm parameter. This can be combined as it doesn't make sense to optimise for one algorithm but then train a model using another.

brenmous commented 4 years ago

There's also no integration tests or system tests, so none of the test suites cover parsing the config and providing the parameters to the workflow steps. This will have to change so that doing any sort of restructuring won't break UncoverML.

brenmous commented 4 years ago

Given the size of the init for the Config class, and the fairly clear distinction between [covariate loading and scaling], [target loading and algorithm selection], [prediction output and formatting], we can split the parsing into these three chunks to make it easier to read. It may also be possible to only parse the portions we need depending on the command being run.

brenmous commented 4 years ago

Changes to make: