Closed ronaldvdv closed 1 year ago
@ronaldvdv nice idea! At the moment we rely on naming schemes for multiple files to create labels like this (for example, the plot example in the readme uses the 'Log' column, which is constructed from the log file name, to label runs in the way you are suggesting) but something that builds this string without relying on log file naming could be handy.
I think this is best done in post-processing though, rather than by trying to inject the logic. It's actually a bit harder to do nicely before the defaults are filled, because the datatype of a numeric pandas column must be float
if there are missing values (restricted by pandas' implementation). And if the user wants to intervene in the logic then it would also be simpler in post-processing than through callback logic. I had a quick attempt at this here that could be applied to an existing summary dataframe - let me know if this does the trick? Maybe we could collect helpers like this in a submodule.
A couple of other things to think about:
@simonbowly I believe the naming scheme for the log filenames is based on our internal tools that kick off the runs in the first place right? In that case we should avoid relying on those names in our publicly released tool, or document the expected filename structure.
I'm also thinking that actually putting the default values in, is actually a form of postprocessing too. Then your suggestion is basically an "undo" postprocessor. Shouldn't we instead make the filling in of default values, an optional step?
Should some parameters such as TimeLimit be eliminated from these checks for explicit parameters?
Yes, at least for my usage: When tuning, I almost always set this parameter and it's the same for all runs. Shall we add an argument to the postprocessing function that specifies which parameters to ignore? We can set some useful defaults including TimeLimit (perhaps also MIPGap).
If a parameter is set explicitly to it's default, should this be captured? (Currently this is not in the logs as an explicit parameter change, so still appears as NaN before the defaults are filled)
Ah, I didn't know that! That changes the story a bit. I was wondering if we want to see the difference between "default because unspecified" and "default because explicitly set". If the log file works as you say and we don't want to rely on filenames, then we can't really see the difference right? And then I fully agree that postprocessing would be sufficient.
I believe the naming scheme for the log filenames is based on our internal tools that kick off the runs in the first place right? In that case we should avoid relying on those names in our publicly released tool, or document the expected filename structure.
Yes, and good point!
I'm also thinking that actually putting the default values in, is actually a form of postprocessing too. Then your suggestion is basically an "undo" postprocessor. Shouldn't we instead make the filling in of default values, an optional step?
I don't think so. It is kind of a post-processing step, but it's a fairly important one: without it, plotly silently drops entries corresponding to null values of a parameter column when that parameter is used as a colour or coordinate in a box plot. So I guess I see this as more "cleanup" than "post-process". It also means the numeric types of parameters are correct, and without that step, the same approach to producing naming strings comes out with e.g. Method1.0-MIPFocus3.0
instead of the cleaner Method1-MIPFocus3
. There's a lot more fiddling required on the user side required to get this right after the fact. Does that seem reasonable?
Yes, at least for my usage: When tuning, I almost always set this parameter and it's the same for all runs. Shall we add an argument to the postprocessing function that specifies which parameters to ignore? We can set some useful defaults including TimeLimit (perhaps also MIPGap).
Sure, this sounds good, I can add that.
When the tool extracts parameter settings into individual columns, each run that did not have an explicitly defined value for a certain parameter will have the default parameter value in that column. This means that from the parameter columns, one cannot easily see which parameters were defined.
The default values are added in
fill_default_parameters
The most important use case for me would be to show a simple summary of the parameter combination in tables and plots. For example, if Heuristics was not set explicitly for a particular run but NodeMethod was value 1, then we could summarize the log file as "NodeMethod=1".
I would propose we change the
fill_default_parameters
function. We first find the columns that relate to parameter settings (usingre_parameter_column.match(column)
like here). Then for each log file, we collect the non-NaN-values of these columns, pass them through one or more callbacks (together with the parameter names) and store the result in a new column. One example of a (default) callback would be the string formatter above. Another example would be to count the number of non-default-values which is often relevant for tuning to prefer smaller combinations of parameters.