fused_data_set.py
-----------------
- CLEAN_UP: The 'save' 'save_pickle' functions are not implemented and should be removed until they are implemented
- UPGRADE: The set_data method takes full vectors for all the jobs. One should have the option for editing data in a specific job or multiple jobs. Furthermore, one should be able to specify a name or a range/set of names in the event that one wants to set a bunch of data for a given job id.
- UPGRADE: At line 114 verifying that the data is the same length as job_cnt implicitly assumes that a given variable is a scalar. Maybe some additional meta-data is needed to allow data to be arrays
- CLEAN_UP: In a couple functions, 'set_data' for example. There are default argument NONE for 'name' arguments. Then immediately you through an exception if it is None. A better approach is to not give name a default argument.
- MODIFICATION: The status data needs to have low-level methods to manipulate aswell. What if I only want to set job id 1 for example and leave the rest as 'unset'? Furthermore, the status flag can be stored as an int. However it is better that the interface has descriptive set/get methods like, has_data_been_set(name,jobid).
- MODIFICATION: add_empty_data should be renamed to 'declare_variable'
- MODIFICATION: get_output and set_output sholdn't exist because we want to keep the distinction of input/output based on how the data-set is connected to a larger work-flow. These functions are basically doing what the get/set data method is doing, but they are doing it with data in a single job-id. It is better that get/set data is upgraded to be flexible enough to do that instead
- MODIFICATION: In 'add_output' group the objects by tuples instead of lists because the assuption is that there will only be 3 pieces of data.
- UPGRADE: in 'get_job_list' the job_range arguments could be made more flexible where it will take a list, upper/lower or a single job.
- UPGRADE: in 'get_job_list' a second optional argument should be given to control how the job-ids are selected. Currently it will search ALL variables and will select it only if there is one variable that doesn't have a 1. The filtering should be controlled as follows: 1) One could only consider a sub-set of variable names to consider, for example some variable names could be from a CFD work-flow while others are from a HAWC work flow. 2) There are also different ways of filtering based on status. Maybe you want to know all the jobs that succeeded, or failed, or have not been ran. Maybe you want to see that ALL or ANY either acheieves a given condition. I would consider using bit-wise AND, OR and EQUAL masks.
- UPGRADE: In the status flag, you should use the smallest integer needed. You only need 2 bits per object... one for set or not, another for fail or not. Then set accordingly. The rest of the bits are superfluous and waste memory. Another consideration is a vector of bools.
- MODIFICATION: the write_output should maybe be re-names to execute or something similar...
- CLEAN_UP: the type method is a built-in thus we should not define our own version. Furthermore it seems this method is not used. So it should be deleted.
- UPGRADE: In get_numpy_array, there should be an argument that we can use to select a range of job ids
- CLEAN_UP: In the push_input there is a default_input_used variable that doesn't seem to be used
- CLEAN_UP: In the 'pull_output there is a bunch of code to get the rank. This is only needed when the MPI library is loaded. This should be removed
- FUTURE UPGRADE: The data_set_job object should eventually be removed in a way where the case-running can work directly with a data-set object...
- FUTURE UPGRADE: A generic way of connecting work-flows should be implemented in the data-set object
- CLEAN_UP: In general a lot of comments and doc-strings explaining the class is always a good thing...
- CLEAN_UP: Remove any irrellevant comments
fused_mpi_cases.py
------------------
- MODIFICATION: The pre/post Exec functionality should be implemented in the parent class.
fused_np_array.py
-----------------
- FUTURE UPGRADE: In the future, the dataset should be able to take a full numpy array to construct itself. In the long run I don't want to support 2 ways of running a DOE, one through a numpy array and another through the data set...
- MODIFICATION: Independent_Variable_np_Array is redundant because the parent class already does what this does
- MODIFICATION: in read_np_array a default argument is given that will cause an exception.
- CLEAN_UP: In the np_Array_Reader add comments on what all the variables mean
- CLEAN_UP: In np_Array_Reader and other parts, use uppercase for np in the names of classes.
- BUG: At line 53. the var_size must be set to 1 not 11
- MODIFICATION: Basically there needs to be more comments in general about the intended use of the different classes, functions and variables.
- MODIFICATION: The np_Array_Work_Flow and np_Array_Job assume that the results are returned to the case runner and it is the responsibility of the case runner to collect and manage the results. I don't want this. Each object should do one thing and one thing only.
- MODIFICATION: Overall, this source file should eventually be made obselete
fused_surrogate.py
------------------
- MODIFICATION: This file is a wrap of some sklearn objects. This should be called FUSED_sklearn.py there are other surrogate packages that we may support in the future. Since it appears a lot of this is for sklearn, then lets just seperate from generic surrogate stuff.
- SUGGESTION: There seems that there may be a bit too much over-wrapping... I am not sure if Linear_Model and Kriging_Model are really needed. However, I am not sure ... so I am not suggesting anything.