Possible Bug - Integration with Optuna - RuntimeWarning invalid value encountered in cast

Serendipity31 commented 1 year ago

Description

I'm trying to tune a set of deepAR hyperparameters using Optuna. I'm following the instructions/example given here.

Optuna will complete trials, but en route to completing any given trial, numerous instances of a particular RuntimeWarning are generated. Here's what happens:

I create the study object exactly as per the tutorial linked to above

I execute study.optimize() - for example:

study.optimize(
  func = DeepARTuningObjective(dataset, prediction_length=1), 
  gc_after_trial = True,
  n_trials = 3,
  timeout = 600
)

At some point, it will get 'stuck' on one particular epoch. Every time it gets stuck, the same message is printed on the console. For example, if it gets stuck on epoch 5, it will look like this:

Epoch 5: |          | 27/? [00:04<00:00,  6.25it/s, v_num=0, train_loss=1.400]C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast
value = np.asarray(data[self.field], dtype=self.dtype)

After first showing the above, the RuntimeWarning gets repeated in excess of 30 times. The messages printed to the console look like this:

C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)
C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)
C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)
C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)
C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)

Then another instance of the following gets printed:

Epoch 5: |          | 27/? [00:04<00:00,  6.25it/s, v_num=0, train_loss=1.400]C:\Users\usr\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast
  value = np.asarray(data[self.field], dtype=self.dtype)

This cycle will continue for a while until suddenly it all seems to be 'sorted'. The RuntimeWarnings stop, and then the function execution proceeds as one would expect. From this point forwards, the outputs are per-epoch and report what the training_loss is.

It's never consistent which epoch triggers this behaviour. In all my experiments, it happens every trial. However, it only ever gets stuck once per trial, regardless of how many epochs are in a trial or how many trials are called in a single call to optimization.

Error message or code output

C:\Users\usr~1\DOCUME~1\VIRTUA~1\DEEPAR~1\Lib\site-packages\gluonts\transform\convert.py:137: RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)

Environment

Operating system: Windows 10 Enterprise
Python version: 3.11.5
GluonTS version: 0.13.7
Torch version: 2.1.0
Optuna version: 3.4.0

Data

Number of series: 10,565
Target: Positive, real numbers
Frequency: Monthly
Imputation_method: left as default in DeepAREstimator. There are months where the target series do not have data.

lostella commented 1 year ago

@Serendipity31 it seems to me like the warning is not related to Optuna: you can check this by just training a standalone DeepAREstimator on the same data.

The warning appears to be originating from this line.

Things to do to identify the issue:

Reproduce the issue with a standalone DeepAREstimator
Reproduce the issue with a minimal dataset: for example, training on a single entry (maybe containing missing values?)

If you could share such an entry from 2. I can also take a look into it

Serendipity31 commented 1 year ago

Hi @lostella - Thanks for looking into this! I've spent some time trying to narrow it down (explained below). If you are able to shed any light on what's happening (and whether it functionally matters or not), that would be great. It's very unclear to me if this is something I need to somehow resolve (or not) before training and testing a model 'for real' with my data.

What I've Tried

I set up a standalone DeepAREstimator, and you're right that it's nothing to do with Optuna. That was clearly a misunderstanding on my part. The same runtime warning appears with a standalone DeepAREstimator
I have explored various permutations of minimal dataset (with and without features of various kinds; with and without missing data in the target). What I've deduced is that the warning message appears to be related to the presence of missing values in the target array.

Other observations

The warning message is unaffected by the presence or abence of dynamic features or static categoricla features.
It seems like this particular warning message is a floating point warning coming from numpy. However, I have not managed to trigger it in very simple examples outside of actually trying to train and forecast from a DeepAREstimator. For instance, the following does not produce the warning:
```
import numpy as np
```

a = [1.75, 2.34, 3.01, float('nan'), 5.35, 0.23] np.asarray(a) # does not produce the warning np.ndarray(shape=(6,), dtyp='float64', buffer=np.array(a)) #also does not produce the warning


3. Setting the imputation method to something other than the default `DummyValueImputation` in the `DeepAREstimator` does not stop the warning message from being triggered. 

4. The warning message doesn't print once per series in the dataset. One of my minimal examples includes 4 series (each of which has missing data in the target array), and another example has just 1 series (with missing data). In both cases `estimator.train()` produces the warning a single time (though I know when using my whole dataset it gets printed to the console many, many times). It's unclear to me what governs its printing.

5. The warning message is also produced when executing `forecasts = list(forecast_it)`

6. The predictions don't _seem_ to be dramatically affected by models that are trained when the warning is produced. I created a version of the third example data provided below, turned more than 10% of the target data to NA, and then used that altered version of the data in the workflow outlined below. I generated ~10 forecasts from this version of the data and compared it to ~10 forecasts generated from the actual data. They weren't the same as each other, but they were broadly similar. I know 10 data points isn't enough to say anything definitely, but anecdotally, it is somewhat reassuring.

#### Sample Datasets
I have three minimal datasets that I can share:

- Example 1 contains four series. They all have months where the target is NA because it doesn't exist
  [example01_fourts_gaps.csv](https://github.com/awslabs/gluonts/files/13180872/example01_fourts_gaps.csv)

- Example 2 is a single series. It has months where the target is NA
  [example02_onets_gaps.csv](https://github.com/awslabs/gluonts/files/13180874/example02_onets_gaps.csv)

- Example 3 is a single series. It is not missing data in any month. 
  [example03_onets_nogaps.csv](https://github.com/awslabs/gluonts/files/13180877/example03_onets_nogaps.csv)

#### Workflow used with Sample Datasets
I created the objects in R and then used the [Reticulate package](https://raw.githubusercontent.com/rstudio/cheatsheets/main/reticulate.pdf) to access them in Python. 
```{python}
# Suppose in R the relevant datafame is called 'data', and the reticulate package has been loaded. Suppose also
# that a python virtual environment has been activated. To access 'data' in that python virtual environment
# execute the code below. The R data frame will be brought into the Python environment as a Pandas DataFrame.
df = r.data

From here, the workflow proceeds in Python. I use the following create objects in the gluonTS Pandas DataSet format:

ds = PandasDataset.from_long_dataframe(
  dataframe  = data,
  target = "target",
  item_id = "item_id", 
  timestamp = "timestamp",
  freq = "M"
)

I define the dataentry_to_dataframe() function as per here, and then use the code below to make the training and validation objects.

train, test_template = split(ds, offset=-1)
validation = test_template.generate_instances(prediction_length=1)
validation_input = [entry[0] for entry in validation]
validation_label = [dataentry_to_dataframe(entry[1]) for entry in validation]

The exact way I set up and ran the rest of the tests from today is shown below:

estimator = DeepAREstimator(
      freq = "M",
      prediction_length = 1,
      context_length = 1,
      num_layers = 2,     
      hidden_size = 97,  
      lr = 0.0004845605502408138,
      weight_decay = 6.726280769742332e-06,
      dropout_rate = 0.1677960921560242, 
      batch_size = 60,    
      distr_output = GammaOutput(),  #Though the warning is also triggered with StudentTOutput()
      #imputation_method = LastValueImputation(), #I have not tried other imputation options...
      trainer_kwargs ={
            "max_epochs": 5,
           "deterministic": True
      },
)

predictor = estimator.train(train, cache_data=True) # Triggers warning ...
forecast_it = predictor.predict(validation_input)
forecasts = list(forecast_it) # Also triggers warning...

awslabs / gluonts