SforAiDl / genrl

A PyTorch reinforcement learning library for generalizable and reproducible algorithm implementations with an aim to improve accessibility in RL
https://genrl.readthedocs.io
MIT License
403 stars 59 forks source link

Adding a new Data Bandit using the Titanic Data #301

Closed TMorville closed 4 years ago

TMorville commented 4 years ago

I am getting an error that might be related to #300, or possibly to do with custom implementation.

To reproduce:

1) Make a user and download the data here: https://www.kaggle.com/c/titanic 2) Running this code creates as genrl compatible data set

def _format_titatic():

    gender_submission = pd.read_csv('gender_submission.csv')
    test = pd.read_csv('test.csv')
    train = pd.read_csv('train.csv')

    train_str = train.select_dtypes(include='object').fillna('0')
    train_float = train.select_dtypes(include='float64')
    train_int = train.select_dtypes(include='int64')

    le = LabelEncoder()

    train_str_enc = train_str.apply(le.fit_transform)

    train_enc = pd.concat([train_str_enc, train_float, train_int], axis=1)

    _df = pd.DataFrame()

    _df[0] = train_enc.Survived + 1

    for i, c in enumerate(list(train_enc.drop('Survived', axis=1))):
        _df[i + 1] = train_enc[c]

    return _df

Now run the custom bandit code with the small change from #300 (still to be confirmed if actually an error):

import torch

from typing import Tuple
from genrl.utils.data_bandits.base import DataBasedBandit

class TitanicDataBandit(DataBasedBandit):

    def __init__(self, **kwargs):
        super(TitanicDataBandit, self).__init__(**kwargs)

        self._df = _format_titatic()
        self.n_actions = len(self._df[0].unique())
        self.context_dim = self._df.shape[1] - 1
        self.len = len(self._df)

        print(self.n_actions, self.context_dim, self.len)

    def reset(self) -> torch.Tensor:
        self._reset()
        self.df = self._df.sample(frac=1).reset_index(drop=True)
        return self._get_context()

    def _compute_reward(self, action: int) -> Tuple[int, int]:
        label = self._df.iloc[self.idx, 0]
        r = int(label == (action + 1))
        return r, 1

    def _get_context(self) -> torch.Tensor:
        return torch.tensor(
            self._df.iloc[self.idx, 1:].values,
            device=self.device,
            dtype=torch.float,
        )

bandit = TitanicDataBandit()
context = bandit.reset()

from genrl.agents import NeuralLinearPosteriorAgent

agent = NeuralLinearPosteriorAgent(bandit)
context = bandit.reset()

action = agent.select_action(context)
new_context, reward = bandit.step(action)

from genrl.trainers import DCBTrainer

trainer = DCBTrainer(agent, bandit)
trainer.train(timesteps=5000, batch_size=32)

yields a shape error:

Started at 31-08-20 15:40:43
Training NeuralLinearPosteriorAgent on TitanicDataBandit for 5000 timesteps
timestep                  regret/regret             reward/reward             regret/cumulative_regret  reward/cumulative_reward  regret/regret_moving_avg  reward/reward_moving_avg  
100                       0                         1                         45                        55                        0.45                      0.55                      
200                       0                         1                         89                        111                       0.445                     0.555                     
300                       1                         0                         136                       164                       0.452                     0.548                     
400                       1                         0                         178                       222                       0.444                     0.556                     
500                       1                         0                         226                       274                       0.464                     0.536                     

Encounterred exception during training!
size mismatch, [2 x 12], [51] at ../aten/src/TH/generic/THTensorMath.cpp:292

Training completed in 1 seconds
Final Regret Moving Average: 0.46 | Final Reward Moving Average: 0.54
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/rl/lib/python3.6/site-packages/genrl-0.0.1-py3.6.egg/genrl/trainers/bandit.py", line 185, in train
    action = self.agent.select_action(context)
  File "/usr/local/anaconda3/envs/rl/lib/python3.6/site-packages/genrl-0.0.1-py3.6.egg/genrl/agents/bandits/contextual/neural_linpos.py", line 153, in select_action
    values = torch.mv(beta, torch.cat([latent_context.squeeze(0), torch.ones(1)]))
RuntimeError: size mismatch, [2 x 12], [51] at ../aten/src/TH/generic/THTensorMath.cpp:292
{'regrets': [0,

I can't seem to figure out why this fails after 500 time steps.

EDIT:

Seems that the shape of beta is changed from [2, 51] to [2, 12] at timestep 500. For some reason this part of NeuralLinearPosteriorAgent.select_action fails

            beta = (
                torch.tensor(
                    np.stack(
                        [
                            np.random.multivariate_normal(
                                self.mu[i], var[i] * self.cov[i]
                            )
                            for i in range(self.n_actions)
                        ]
                    )
                )
                .to(self.device)
                .to(torch.float)
            )

and the exception occurs triggering a new calculation of beta:

        except np.linalg.LinAlgError as e:  # noqa F841

            print("Linalg error.")

            beta = (
                (
                    torch.stack(
                        [
                            torch.distributions.MultivariateNormal(
                                torch.zeros(self.context_dim + 1),
                                torch.eye(self.context_dim + 1),
                            ).sample()
                            for i in range(self.n_actions)
                        ]
                    )
                )
                .to(self.device)
                .to(torch.float)
            )
TMorville commented 4 years ago

There was an error in the exception loop.

Replacing self.context_dim with self.latent_dim yields the correct dimensions and allows training.

Can authors confirm that this is correct? If yes, I can do a PR.

threewisemonkeys-as commented 4 years ago

I was able to get the following error after following the reproduction steps -

2 11 891

Started at 31-08-20 14:44:10
Training NeuralLinearPosteriorAgent on TitanicDataBandit for 5000 timesteps
timestep                  regret/regret             reward/reward             regret/cumulative_regret  reward/cumulative_reward  regret/regret_moving_avg  reward/reward_moving_avg  
100                       1                         0                         51                        49                        0.51                      0.49                      
200                       1                         0                         94                        106                       0.47                      0.53                      
300                       0                         1                         140                       160                       0.448                     0.552                     
400                       1                         0                         195                       205                       0.492                     0.508                     
500                       0                         1                         229                       271                       0.44                      0.56                      

Encounterred exception during training!
array must not contain infs or NaNs

Training completed in 1 seconds
Final Regret Moving Average: 0.444 | Final Reward Moving Average: 0.556
Traceback (most recent call last):
  File "/content/genrl/genrl/trainers/bandit.py", line 185, in train
    action = self.agent.select_action(context)
  File "/content/genrl/genrl/agents/bandits/contextual/neural_linpos.py", line 128, in select_action
    for i in range(self.n_actions)
  File "/content/genrl/genrl/agents/bandits/contextual/neural_linpos.py", line 128, in <listcomp>
    for i in range(self.n_actions)
  File "mtrand.pyx", line 4082, in numpy.random.mtrand.RandomState.multivariate_normal
  File "/usr/local/lib/python3.6/dist-packages/scipy/linalg/decomp_svd.py", line 109, in svd
    a1 = _asarray_validated(a, check_finite=check_finite)
  File "/usr/local/lib/python3.6/dist-packages/scipy/_lib/_util.py", line 246, in _asarray_validated
    a = toarray(a)
  File "/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py", line 499, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

This was resolved by removing nans from the dataframe through a self._df = self._df.fillna(0)

The reason the error shows up after 500 steps is that by default, the DCBTrainer waits 500 steps before starting to update the agent's parameters.

threewisemonkeys-as commented 4 years ago

There was an error in the exception loop.

Replacing self.context_dim with self.latent_dim yields the correct dimensions and allows training.

Can authors confirm that this is correct? If yes, I can do a PR.

This might also be a legitimate issue. Looking into it

threewisemonkeys-as commented 4 years ago

@TMorville could you check if removing nan values from the dataframe works for you?

TMorville commented 4 years ago

I dont get the NaN error when running the code 🤔

1) If I keep nan values in data and use 'latent_dim' it also works and leads to a cumulative reward of 1843.

2) If I remove nan values in data and use context_dim it also works and leads to a cumulative reward of 1539.

threewisemonkeys-as commented 4 years ago

Yep, you are right, it looks like it should be latent_dim instead of context_dim.

Not sure why the except block was being executed in your case since it wasnt in mine. Maybe different numpy versions might calling different things LinAlgError. 🤔

Either way, thanks for raising the issue! Feel free to open a PR.

threewisemonkeys-as commented 4 years ago
  1. If I keep nan values in data and use 'latent_dim' it also works and leads to a cumulative reward of 1843.

Is this after running for 5000 steps? I am consistently getting >3000 cumulative reward.

TMorville commented 4 years ago

Played around with it for a bit and found that:

My genrl is installed from source with version 0.0.1.