autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.63k stars 268 forks source link

How to use f1 measure for "best" model? #56

Closed bml1g12 closed 6 years ago

bml1g12 commented 6 years ago

When I import from talos.metrics.keras_metrics import fbeta_score and compile the model with this metric, then run talos with the parameter reduction_metric="fbeta_score" the output csv seems to list the the val_acc of the best epoch for val_acc, but only the first epoch's value for fbeta_score. This seems like something is going wrong, as if anything it should be producing the corresponding fbeta_score for that epoch I would have though.

I am not interested in accuracy due to class imbalance in my system, and the accuracy saturates after a few epochs, so I need to talos to store for each parameter combination either:

a) The result of the last epoch b) Ideally the result with the best fbeta_score

Given that fbeta_score has been implemented, I assume this must be possible but I don't see how.

I am using the latest dev branch v0.2 (as I have augmented data, I needed the functionality to supply x_val and y_val as parameters). In order to run this code without bugs, I needed to change; talos/metrics/score_model.py line 17 from y_pred = self.keras_model.predict_classes(self.x_val) to y_pred = self.keras_model.predict(self.x_val)

Which might be related to my problem.

matthewcarbone commented 6 years ago

Might be related to #3

@bml1g12 I understand your problem. I do a lot of work with class imbalanced data sets as well. I can look into this for you soon, but @mikkokotila knows the code much better and might have quicker insight.

Also as an aside, glad to know that supplying x_val and y_val independently worked for you! πŸ‘

matthewcarbone commented 6 years ago

Sorry @bml1g12 just co clarify, are you saying the feature in which you input x_val and y_val does not work properly, and you have to make that change to y_pred?

bml1g12 commented 6 years ago

Thank you. Yes having the x_val, y_val functionality is a great addition, as without it I would not be able to use talos. Indeed the values for _val printed by keras seem to be working fine. (The issue is with the data that Talos chooses to save at the end of a parameter. )

Sorry yes I was not very clear, I will clarify:

With the source code unedited, when I ran: h = ta.Scan(X_train, Y_train, x_val=X_dev, y_val=Y_dev, params=p, dataset_name="debug", experiment_no="1", model=keras_nn_model_talos, grid_downsample=0.002, talos_log_name="talos.log", reduction_method="spear", reduction_metric="fbeta_score")

I obtained the following stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-b4cbea7ca6f1> in <module>()
      8      'second_GRU_layer':[True, False]}
      9 h = ta.Scan(X_train, Y_train, x_val=X_dev, y_val=Y_dev, params=p, dataset_name="debug", experiment_no="1", 
---> 10             model=keras_nn_model_talos, grid_downsample=0.002, talos_log_name="talos.log", reduction_method="spear", reduction_metric="fbeta_score")
     11 
     12 ## I had to edit a line of ~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/metrics/score_model.py

~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, dataset_name, experiment_no, model, x_val, y_val, val_split, shuffle, search_method, reduction_method, reduction_interval, reduction_window, grid_downsample, reduction_threshold, reduction_metric, round_limit, talos_log_name, debug, seed, clear_tf_session, disable_progress_bar)
    140         # input parameters section ends
    141 
--> 142         self._null = self.runtime()
    143 
    144     def runtime(self):

~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/scan/Scan.py in runtime(self)
    145 
    146         self = scan_prepare(self)
--> 147         self = scan_run(self)

~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     27                      disable=self.disable_progress_bar)
     28     while len(self.param_log) != 0:
---> 29         self = rounds_run(self)
     30         self.pbar.update(1)
     31     self.pbar.close()

~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/scan/scan_run.py in rounds_run(self)
     59 
     60     _hr_out = run_round_results(self, _hr_out)
---> 61     self._val_score = get_score(self)
     62     write_log(self)
     63     self.result.append(_hr_out)

~/anaconda3/envs/tfgpu-keras/lib/python3.6/site-packages/talos/metrics/score_model.py in get_score(self)
     15 
     16     try:
---> 17         y_pred = self.keras_model.predict_classes(self.x_val)
     18        # y_pred = self.keras_model.predict(self.x_val)
     19         return Performance(y_pred, self.y_val, self.shape, self.y_max).result

AttributeError: 'Model' object has no attribute 'predict_classes'

This issue is unrelated to x_val, y_val, as the following code still produces it:

h = ta.Scan(X_train, Y_train, params=p, dataset_name="debug", experiment_no="1", 
            model=keras_nn_model_talos, grid_downsample=0.002, talos_log_name="talos.log")

I only mentioned that because that is the reason I am using the development branch, as that feature is not yet in the main branch. Sorry to confuse. I fixed that code by changing predict_classes to predict, which was based on a forum post I read somewhere.

My model is as follows:

def keras_nn_model_talos(x_train, y_train, x_val, y_val, params):
    X_input = Input(shape = x_train.shape[1:])

    # Step 1: CONV layer 
    X = Conv1D(filters=int(params["num_filters"]), kernel_size=15,strides=4)(X_input)  # CONV1D
    X = BatchNormalization()(X)                               # Batch normalization
    X = Activation('relu')(X)                                 # ReLu activation
    X = Dropout(rate=params["dropout_rate"])(X)                                 # dropout (use 0.8)

    if params["second_GRU_layer"]:
        # Step 2: First GRU Layer
        X = GRU(units = int(params["gru_hidden_units"]), return_sequences = True)(X)         # GRU (use 128 units and return the sequences)
        X = Dropout(rate=params["dropout_rate"])(X)                                  # dropout (use 0.8)
        X = BatchNormalization()(X)                                 # Batch normalization

    # Step 3: Second GRU Layer 
    X = GRU(units = int(params["gru_hidden_units"]), return_sequences = True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(rate=params["dropout_rate"])(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Dropout(rate=params["dropout_rate"])(X)                                   # dropout (use 0.8)

    # Step 4: Time-distributed dense layer 
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    model = Model(inputs = X_input, outputs = X)

    opt = Adam(lr=params["adam_learning_rate"], beta_1=0.9, beta_2=0.999, decay=0.01)
    from talos.metrics.keras_metrics import fbeta_score
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=[fbeta_score]) #"acc", my_recall, my_precision, f1

    history = model.fit(x_train, y_train, batch_size = int(params["batch_size"]), 
          validation_data=(x_val, y_val),
          epochs=int(params["epochs"])) 

    return history, model  

With regard to issue #3 I also find it odd that Keras implemented only a batch-wise F1 score, and their solution to implementing it on a per-epoch level was to throw in in the garbage can entirely. I would have though about 50% of keras users need an F1 score at somepoint.

Would I be right in saying that, in theory atleast, setting reduction_metric="fbeta_score" should produce a final .csv file with each row showing a parameter combination and the respective scores for the epoch which had highest fbeta_score?

mikkokotila commented 6 years ago

@bml1g12 To answer the question about reduction_metric first, to do this in your model.compile you have to call fbeta_score (I can see you are doing) as metric. reduction_metric is for purpose of the optimization algorithm (other than random). This should yield what you are looking for.

That said, are you reporting that you had validated that actually this is not the case and instead in your experiment .csv you get fbeta_score for the first epoch of each permutation? If that's the case, it should be a very simple fix.

ps. I'm also baffled by the decision to just give up on F1 score as opposed to dealing with it.

bml1g12 commented 6 years ago

I see so I have it provided with the correct argument at least (albeit reduction_metric not being necessary).

Exactly, I instead get seemingly only the first epoch of each permutation (and definitely not the highest fbeta_score). A long-shot, but if Talos is set to report the minimum fbeta_score then maybe that is the reason for this bug.

If I understand correctly, the intended behavior is Talos saves the "best" value of the the metric across all epochs within the permutation to a CSV, usually "accuracy". But if you supply several metrics, how does Talos decide which metric is the one it should be using for this purpose?

bml1g12 commented 6 years ago

I would actually like to select based on val_fbeta_score (validation result). Here is an example output for:

p = {'adam_learning_rate': [0.01, 0.001, 0.0001],
     'num_filters': [12, 32, 64, 196],
     'gru_hidden_units':[32, 64, 128, 196],
     'dropout_rate':[0.2,0.5,0.8],
     'batch_size': [64, 128, 256],
     'epochs': [10],
     'second_GRU_layer':[True, False]}
h = ta.Scan(X_train, Y_train, params=p, dataset_name="debug", experiment_no="1", 
            model=keras_nn_model_talos, grid_downsample=0.002, talos_log_name="talos.log", reduction_method="spear", reduction_metric="val_fbeta_score")

Train on 1260 samples, validate on 540 samples
Epoch 1/10
1260/1260 [==============================] - 3s 2ms/step - loss: 0.8298 - fbeta_score: 0.2810 - val_loss: 0.8771 - val_fbeta_score: 0.3295
Epoch 2/10
1260/1260 [==============================] - 1s 644us/step - loss: 0.7233 - fbeta_score: 0.3322 - val_loss: 0.7126 - val_fbeta_score: 0.3814
Epoch 3/10
1260/1260 [==============================] - 1s 646us/step - loss: 0.6889 - fbeta_score: 0.3599 - val_loss: 0.7488 - val_fbeta_score: 0.3750
Epoch 4/10
1260/1260 [==============================] - 1s 646us/step - loss: 0.6602 - fbeta_score: 0.3924 - val_loss: 0.8280 - val_fbeta_score: 0.3586
Epoch 5/10
1260/1260 [==============================] - 1s 673us/step - loss: 0.6361 - fbeta_score: 0.4247 - val_loss: 0.7142 - val_fbeta_score: 0.4258
Epoch 6/10
1260/1260 [==============================] - 1s 672us/step - loss: 0.6119 - fbeta_score: 0.4499 - val_loss: 0.6764 - val_fbeta_score: 0.4617
Epoch 7/10
1260/1260 [==============================] - 1s 703us/step - loss: 0.5913 - fbeta_score: 0.4703 - val_loss: 0.6412 - val_fbeta_score: 0.4842
Epoch 8/10
1260/1260 [==============================] - 1s 654us/step - loss: 0.5705 - fbeta_score: 0.4892 - val_loss: 0.5530 - val_fbeta_score: 0.5617
Epoch 9/10
1260/1260 [==============================] - 1s 635us/step - loss: 0.5515 - fbeta_score: 0.5144 - val_loss: 0.4873 - val_fbeta_score: 0.6037
Epoch 10/10
1260/1260 [==============================] - 1s 638us/step - loss: 0.5339 - fbeta_score: 0.5344 - val_loss: 0.5228 - val_fbeta_score: 0.5852

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:13<00:00, 13.56s/it]
Scan Finished!
round_epochs val_loss val_fbeta_score loss fbeta_score adam_learning_rate num_filters gru_hidden_units dropout_rate batch_size epochs second_GRU_layer
10 0.5227934398033 0.329458241992527 0.533916110462613 0.281042905270107 0.001 196 32 0.2 256 10 1
matthewcarbone commented 6 years ago

Ah ok I understand now. Glad to know the manual input of validation sets was working properly.

Regarding your last question, this is related to #54 (I think) and I am going to post an easy fix for it. I presume you just want to essentially sort the output dataframe by the val_fbeta_score, correct?

Posted the answer to what I think your question was in #54. Let me know if that was helpful!

bml1g12 commented 6 years ago

I presume you just want to essentially sort the output dataframe by the val_fbeta_score, correct?

Yes, that's right.

I think your explanation in #54 is related to how to sort the resulting table by a chosen metric, but as I understand it, currently whether a row even exists in the table or not depends on what metric is being selected as "the best" for that parameter. I may be misunderstanding: I am currently assuming the object produced by ta.Reporting does not save every epoch's metric result but instead stores the "best" across all epochs?

i.e. if we do a 2 epoch run where each run showed this output in keras: Parameter 1: epoch 1: accuracy: 0.2, val_fbeta_score, 0.2 epoch 2: accuracy: 0.3, val_fbeta_score 0.1 If I understand correctly, it will currently output in the final CSV either a row like: A) Parameter 1, accuracy:0.3, val_fbeta_score:0.1 OR B) Parameter 1, accuracy:0.2, val_fbeta_score:0.2

So my question is, how do I tell it to produce B) and not A)? Regardless of how the output itself is column-sorted.

matthewcarbone commented 6 years ago

@bml1g12 Ah I understand. I believe the output you see is the final epoch's result, so you will not see the result of the first epoch, only the second. Currently to my knowledge we do not save every epoch's results. That said, you can implement early stopping criteria if you want, but I don't recommend it (neither does Andrew Ng) since you have no way of knowing how that will impact your overall end (testing set) result.

In principle, we should report a statistical average over many executions evaluated on the validation set. This is a planned feature (discussed in #40, #18) adds an extra order of computational complexity to the calculation. It is an expensive albeit necessary part of the plan for Talos I would say.

So are you looking for the history so to speak? That might be new-issue-worthy.

bml1g12 commented 6 years ago

I see thank you, I misunderstood - I see now Talos currently selects what to store based the last epoch (i.e. chronology) not "best loss/ best accuracy " etc.

So this issue (#56) is then relating to the fact that in my test, it does not produce the last epoch's val_fbeta_score (shown in my post above with the output pasted). Do you know any solution for this?

(With regards to the statistical average, that seems like a great feature in case you get an anomalous change in model performance the last epoch. I agree with Andrew Ng that early stopping should generally be avoided: equally a history would allow users to explore the effect of the number of epochs on model performance without needing to run Talos with extra permutations for Epochs. )

matthewcarbone commented 6 years ago

Ah. Yeah that's a problem. I will try to look into this when I can. If you have any suggestions as well feel free to comment!

matthewcarbone commented 6 years ago

The problem is here I think. This is precisely what I think @mikkokotila was talking about in #3. The problem it seems is that the F1 score is not implemented at the epoch level. I still need to dig some more to figure out why this is returning the first value for the F1 score and not any other one...

bml1g12 commented 6 years ago

Thank you.

I'm not sure it is a problem with F1 score itself, as it produces sensible results from the output of Keras per-epoch, it seems to me that Talos just isn't saving the right value. It would be painful to do, but I could work around this issue by printing all keras output to a file and using grep to obtain the last result of each epoch, then stitching it back together with the parameters reported using Talos.

But I am unfamiliar with the source code so maybe it is something more fundamental.

matthewcarbone commented 6 years ago

Ah I'm sorry. You're totally right. The lack of sleep is getting to me! I'll get on this sometime soon when I have more energy.

I'm not sure if Talos is actually using the Keras metric to generate the F1 score you see in the pandas output or if its using its own. This needs to be consistent in the future.

Anyway, I will look into this at some point. I appreciate you bringing this to our attention since it is a very real problem that needs fixing.

bml1g12 commented 6 years ago

Thank you!

mikkokotila commented 6 years ago

@bml1g12 regarding "A long-shot, but if Talos is set to report the minimum fbeta_score then maybe that is the reason for this bug."

You got it. That's it. There is one hard coded remnant (I hope last one) from the beginning of the package where it takes minimum value unless the word 'acc' is in the metric in the history object. I will fix that ASAP as it's kind of silly. Creating a new issue for it.

mikkokotila commented 6 years ago

This is now resolved in daily-dev with more info on the closed #62.

Sorry for causing doubt with the bad decision I had made previously. Unfortunately the resolved situation is not much more intuitive i.e. we would have to include the string 'acc' into any custom metrics that are added to Talos, and then the user would have to do the same for their own custom metrics. This should be ok though, as it is in accord with Keras convention of using the _acc postfix (at least in Keras 2 this seems to be the case).

I will leave this open for a bit in case I missed something.

bml1g12 commented 6 years ago

OK I'll give it a try, so I should append "_acc" to any custom metric name, got it. To clarify, what would be the consequence of not appending _acc? Simply that Talos takes the lowest value I guess?

As long as the documentation explains this, it isn't too unintuitive at least.

mikkokotila commented 6 years ago

@bml1g12 Great :) You are right, the consequence is that it will be treated as a something to be minimized i.e. the lowest value will be given instead. I also considered the possibility of showing min, peak, and max, but that would mean 3 times more columns, out of which in most cases 1 or 2 would be noise.

bml1g12 commented 6 years ago

OK I understand.

I think #62 may still be unresolved because x94carbone said above "I believe the output you see is the final epoch's result" which I took to mean that the intended behavior was to select the last epochs value.

I just tested it and it now seems to produce the highest value of each _acc metric across all epochs, as opposed to the last epoch's value.

Selecting the best value across all epochs can be a little confusing if you have more than one metric, because if one column of the table is from a different epoch as another column of the table, then you are essentially not comparing apples with apples. Also it not clear how many epochs it took to obtain the value displayed, as it is different for each column.

So I need to ask, what is the intended behavior? a) For each metric, store the result of the last epoch b) For each metric, store its best across all epochs? <--- seemingly current behavior c) Given a metric of interest, find the epoch that performed best on that epoch, and store all metric values for that particular epoch (along with the epoch number).

I think (b) Is useful if you are doing early stopping, but given the epoch that produced the result can't be obtained currently, it would need this information for each metric. (a) is simple to interpret at least, and what x94carbone seemed to think it should be. And (c) is what I would personally find most useful I think.

In an ideal world, the user could select between method (A, B or C), but I can appreciate it might not be worth coding that. If the current behavior is the intended behavior, then I think it is crucial that a history is kept so the user can figure out which epoch produced the result listed (and thereby reproduce it).

mikkokotila commented 6 years ago

What happens is that for each metric, the best across all epochs is stored in the experiment log. I like the idea about allowing the user to choose what they want to store, as some might want the last for some reason. Also I think that the point you are making about the approach for peak being confusion, do you think it would be enough to allow the user to set this in Scan() parameters, or should we show both peak and last?

I don't understand 'c' though, could you clarify?

And I apologize for the confusions, it has to do with the rather cryptic way the related part of Talos is handled. Why it's cryptic is pretty important though; it allows us to avoid hardcoding hyperparameter names anywhere, and instead allow the user to add any which they like to use.

bml1g12 commented 6 years ago

I see thank you, so current behavior is method (b) and it is the intended behavior. I think the issue can be closed once the documentation explains the current behavior.

I think an option to show "peak across epochs" or "last epoch" in Scan() would be great. Whatever the case, it would be good if the documentation makes it crystal clear which parameters are stored by default as I think most users would assume each row of the table corresponds to a single "model", i.e. most users would currently erroneously assume that a row of output .csv shown by Talos could be generated by a specific epoch of a Keras model, when in reality currently each row potentially shows a mixture of different epoch results.

To explain (c) by example, imagine you are interested in obtaining optimal F1 validation score, with the sub-criteria that the precision must be >0.1. Currently this task is not possible using (a) or (b), because neither will show the precision along with the optimal F1 score for the same epoch. Method (c) involves Talos first identifying the epoch with optimal F1 score, then storing the corresponding metrics for that particular epoch.

matthewcarbone commented 6 years ago

I like the idea about allowing the user to choose what they want to store, as some might want the last for some reason.

@mikkokotila This is definitely a critical option since the last value is most representative of the final trained state of the model and therefore the best indicator of how the model will generalize to the reported evaluation on the testing set.

most users would currently erroneously assume that a row of output .csv shown by Talos could be generated by a specific epoch of a Keras model

Thank you, @bml1g12 for pointing this out. This is not something I would have considered before you mentioning it! I didn't realize this wasn't clear. We should fix this in the documentation.

PTerrier commented 6 years ago

The issue is in utils\results.py line 31. The returned results is the smallest among epochs, not the largest one, as it should be. I corrected the conditions for my application, it works as intended.

for key in out.history.keys(): t_t = array(out.history[key]) if (key=='val_acc') or (key == 'acc') or (key == 'fmeasure') or (key == 'val_fmeasure'): peak_epoch = argpartition(t_t, len(t_t) - 1)[-1]
else: peak_epoch = argpartition(t_t, len(t_t) - 1)[0] peak = array(out.history[key])[peak_epoch] _rr_out.append(peak) p_epochs.append(peak_epoch)

it does the job, but not very elegant

mikkokotila commented 6 years ago

This is now fixed in 458f973 and is available in dev. Note that as it stands we get the max epoch, so the last round option will have to be implemented separately. Creating a new issue for that.

Big thanks for everyone here. Closing now.

mikkokotila commented 6 years ago

This is handled in PR #80 and also takes care of #73. Thanks again! :)