VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.46k stars 1.93k forks source link

CATs fails to converge with fixed context and simple convex reward function #2891

Closed maxpagels closed 3 years ago

maxpagels commented 3 years ago

Describe the bug

Not sure if this is a bug, so feel free to close if not. I am running a simple test of CATs, with min_value 0 and max_value 1, where the cost is defined as follows:

The context is fixed for each round, so I would assume this to be a pretty simple learning task. I've made a jupyter notebook that gridsearches a bunch of options (actions, bandwith, epsilon) and can't find a combination that satisfactorily predicts close to 0.5 when exploiting. Some parameter combinations result in wildly different average cost. Changing GD to FTRL doesn't help. Even if I ensure good exploration by changing epsilon, i see no reduction in accrued average cost per round over time.

So my question is, is this behaviour a feature of CATs, some bug, or something I am just missing in my code?

To Reproduce

Install a fresh pyvw wheel from master (pypi doesn't yet have CATS support) and run the attached notebook: CATS-test.ipynb.zip

Expected behavior

I would expect predictions for such a simple use case to converge to the optimal prediction of 0.5 (0 cost)

Observed Behavior

Observe the cost plots in the notebook not trending down over time, regardless of gridsearch combination

Environment

pyvw, OS X, latest built wheel from the master branch

olgavrou commented 3 years ago

@maxpagels from your notebook the example that is being used to predict/learn has no features, i.e. the context is empty. Is this intentional?

olgavrou commented 3 years ago

There is a cats tutorial notebook that has not yet been reviewed but it is here https://github.com/VowpalWabbit/jupyter-notebooks/pull/6 if you think it might be of any help

maxpagels commented 3 years ago

@olgavrou no features is intentional, the same non-convergence happens if you add a constant feature

olgavrou commented 3 years ago

@maxpagels it actually looks like 0.2 is already too big a bandwidth for this use case. I ran the following (building off of your notebook)

param_grid = {  
    'cats': [8, 16, 32, 256, 1024],
    'epsilon': [0.1],
    'bw': [0.05, 0.1, 0.2, 0.5]
}

import itertools
import pandas as pd

all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
best_loss = 999999999999.0
for params in all_params:
    metrics = {
        'costs': [],
        'predictions': []
    }
    options = f"--cats {params['cats']} --min_value 0.0 --bandwidth {params['bw']} --max_value 1.0 --coin --epsilon {params['epsilon']}"
    print('options: ', options)
    vw = pyvw.vw(options, enable_logging=True)
    for i in range(0, 100000):
        ex = "|"
        action, prob = vw.predict(ex)
        cost = get_cost(action)
        learn_ex = vw.parse(f"ca {action}:{cost}:{prob} {ex}", pyvw.vw.lContinuous)
        vw.learn(learn_ex)
        vw.finish_example(learn_ex)
        metrics['predictions'].append(action)
        metrics['costs'].append(cost)
    df = pd.DataFrame(metrics)
    print('mean pred:', df['predictions'].mean())
    df['predictions'].hist().plot()
    plt.show()
    vw.finish()

and it looks like a bandwidth of 0.1 with number of arguments around 16 or 32 does well, also maybe a bandwidth of 0.05 could also work, although it looks like 0.1 does better for a wider range of --cats args. It also looks like --cats 8 and anything above --cats 256 isn't suitable for this test case.

let me know what you think.

Note on the python side of the code: switched predict to take the example string directly, otherwise after every vw.parse() we would need to call vw.finish_example()

maxpagels commented 3 years ago

I'm attaching a horribly changed version of your notebook, with the same setup as before but with range [0, 100] and 50 as the point of zero cost. So the best reward would be 0. The gridsearch, again, nowhere near optimal results in any configuration. Some even start off OK and get way worse. I guess by adding even more combinations I'd get closer, but I would have thought that even with some initial parameter set I'd get better learning than this. If you can confirm this is working as intended, we can close this ticket. At least for online learning without a simulator to start with, I can't unfortunately use the algorithm for my use case because at the start, grid searching is not an option and without the absolute correct values learning seems to go pretty wrong, even for this simple case. I also wouldn't even know what ranges to gridsearch over even if I could gridsearch against a simulator.

cats-example.ipynb.zip

olgavrou commented 3 years ago

@maxpagels it looks like there might be a bug in the tree. I have created a draft PR #2900 with which I tested an adaptation of your latest notebook (attached). Let me know if you try the branch out. Still need to test it more. cats_predict_50.ipynb.zip

maxpagels commented 3 years ago

Used the wheel built for your PR branch. Generally seems much better, but if I remove coin, which should then default to GD, I get weird spikes. Happens regardless of if I set the exploration rate a bit higher. Screenshots attached.

Screenshot 2021-03-20 at 0 00 04 Screenshot 2021-03-19 at 23 59 51
olgavrou commented 3 years ago

@maxpagels cats was mainly developed and tested using coin but gd should also work by adjusting the different learning rates (i.e. learning_rate, power_t, ..). That been said it might make sense to make coin the default with cats with the ability to switch on gd or something else if needed.

maxpagels commented 3 years ago

Making it the default makes sense to me at least.

olgavrou commented 3 years ago

re-opening as we are checking some more things with Akshay