Currently, if the testing and training ({train} U {validate}) are drawn from the same source shot list, then the ratio conf['model']['train_frac'] is used to randomly divide the source shots without regards to the shot classes. This also occurs for the splitting of the train and validate sets with conf['model']['validation_frac'].
So, while the the division of the overall shot counts will exactly match the desired fractions within 1/N (where N is the total number of shots), the division of the non-/ disruptive shots among the sets may not be so close to that fraction. This is only a problem when the number of disruptive (or nondisruptive) samples is low and/or the training and testing sets are drawn from different raw lists. As the number of samples -> infinity, of course the N{validate, disrupt}/N{training, disrupt} -> conf['model']['validation_frac'], e.g.
There is no real reason not to explicitly divide the disruptive and non-disruptive classes when performing the splitting of the shot sets, so I think we should at least add it as an option, if not make it the default behavior
[ ] Consider renaming train_frac to test_frac (value = 1.0 - train_frac) or another name to make it clear that the "training fraction" is further divided between the training and hold-out validation sets.
Currently, if the testing and training ({train} U {validate}) are drawn from the same source shot list, then the ratio
conf['model']['train_frac']
is used to randomly divide the source shots without regards to the shot classes. This also occurs for the splitting of the train and validate sets withconf['model']['validation_frac']
.So, while the the division of the overall shot counts will exactly match the desired fractions within 1/N (where N is the total number of shots), the division of the non-/ disruptive shots among the sets may not be so close to that fraction. This is only a problem when the number of disruptive (or nondisruptive) samples is low and/or the training and testing sets are drawn from different raw lists. As the number of samples -> infinity, of course the N{validate, disrupt}/N{training, disrupt} ->
conf['model']['validation_frac']
, e.g.There is no real reason not to explicitly divide the disruptive and non-disruptive classes when performing the splitting of the shot sets, so I think we should at least add it as an option, if not make it the default behavior
train_frac
totest_frac
(value= 1.0 - train_frac
) or another name to make it clear that the "training fraction" is further divided between the training and hold-out validation sets.