`batch_size` error when lstm is `stateful`

jsadler2 commented 3 years ago

I'm getting this error when trying to use LSTMModel:

ValueError: in user code:

    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/river_dl/rnns.py:40 call  *
        self.rnn_layer.reset_states(states=[h_init, c_init])
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py:914 reset_states  **
        raise ValueError('If a RNN is stateful, it needs to know '

    ValueError: If a RNN is stateful, it needs to know its batch size. Specify the batch size of your input tensors: 
    - If using a Sequential model, specify the batch size by passing a `batch_input_shape` argument to your first layer.
    - If using the functional API, specify the batch size by passing a `batch_shape` argument to your Input layer.

jzwart commented 3 years ago

Hmm.. this is during prediction or training? would this help:

model.rnn_layer.build(input_shape=x_data.shape)

I think right after this line.

jsadler2 commented 3 years ago

This was during training.

I'll give that a try

jsadler2 commented 3 years ago

So I added

self.rnn_layer.build((42, 365, 2))

right above this line https://github.com/USGS-R/river-dl/blob/ec2d9b97f0e333cb81cb579a8318fc2d69aaad92/river_dl/rnns.py#L30

And I got a new error:

RuleException:
TypeError in line 79 of /mnt/d/onedrive/OneDrive - DOI/research/drb/river-dl/Snakefile:
in user code:

    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /mnt/d/onedrive/OneDrive - DOI/research/drb/river-dl/river_dl/rnns.py:41 call  *
        self.rnn_layer.reset_states(states=[h_init, c_init])
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py:961 reset_states  **
        K.batch_set_value(set_value_tuples)
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/tensorflow/python/keras/backend.py:3706 batch_set_value
        x.assign(np.asarray(value, dtype=dtype(x)))
    /home/jsadler/miniconda3/envs/rgcn1/lib/python3.6/site-packages/numpy/core/_asarray.py:83 asarray
        return array(a, dtype, copy=False, order=order)

    TypeError: __array__() takes 1 positional argument but 2 were given

jdiaz4302 commented 2 years ago

Hey all, I'm working on integrating river-dl more with the reservoir forecasting repos, and I was curious if any more progress was made on this - experiencing the initial error myself

jdiaz4302 commented 2 years ago

Lots of references that I find regarding this solve the problem by simply passing a batch_input_shape argument to your first layer as suggested by the comment, but I haven't found a way to implement that within this approach/syntax. Im fairly new to tensorflow (but very familiar with ML/DL and implementation via PyTorch), but it appears that there are 3 general ways to write this model code and our way (model subclassing) isn't as well documented (e.g., the error code suggestion deals with the other 2).

When I find a place to specify batch_input_shape or batch_shape (e.g., at layers.LSTM(), model.build() or model.fit()), the arguments seem to go unrecognized or explicitly raise an error of unknown keyword 😧

jsadler2 commented 2 years ago

Thanks for looking into this, @jdiaz4302. I don't think we ever did find a solution to this. I've found a similar difficulty in finding solutions for other issues because we have used the "model subclassing" approach. From my understanding, I don't think we can use the other two approaches (sequential api and functional api) because they allow for much less customization.

For this issue in particular, I think it has something to do with specifying the batch size in a build statement, like you mentioned. I think I got a little farther at doing this here. Are you able to get this error?

That's awesome that you have a lot of experience with pytorch. We have casually wondered if using pytorch instead of TF would be better given that it seems to be more popular in the research community. Because we have invested in this TF code pretty heavily and since we didn't have any pytorch experience amongst us, we haven't thought very seriously about it. But now that you've joined, I wonder if it'd be worth it to think about it again.

jdiaz4302 commented 2 years ago

Yeah, that's one of the (temporary?) dead-ends that I've found so far - don't think that I've gotten to that bottom of what it means though

jzwart commented 2 years ago

Have you tried adding

model(x_trn_pre)

right above this line and

model(x_trn_obs)

right above this line . Will also need to move these lines above the model.compile()

jdiaz4302 commented 2 years ago

To clarify, I've been working at the function-level in an ipynb that simplifies the usage (based off the one you stored in run-pgdl-da but reduced and modified with the latest river-dl code) - that is, I'm just defining the model and trying to use it on some randomly generated data in a bare-bones setting.

But, when I add a model(inputs) prior to the model.compile() it raises the original error message at the model(inputs) line

jsadler2 commented 2 years ago

can you share the stripped down code that you are working with?

jdiaz4302 commented 2 years ago

This is my modification of Jacob's experimenting notebook that worked for some earlier version or modification of the code (linked earlier via "run-pgdl-da")

Block 1:

from __future__ import print_function, division
import tensorflow as tf
from tensorflow.keras import layers

class LSTMModel(tf.keras.Model):
    def __init__(
        self, hidden_size, num_tasks=1, recurrent_dropout=0, dropout=0,
    ):
        """
        :param hidden_size: [int] the number of hidden units
        :param num_tasks: [int] number of tasks (variables_to_log to be predicted)
        :param recurrent_dropout: [float] value between 0 and 1 for the
        probability of a recurrent element to be zero
        :param dropout: [float] value between 0 and 1 for the probability of an
        input element to be zero
        """
        super().__init__()
        self.hidden_size = hidden_size
        self.num_tasks = num_tasks
        self.rnn_layer = layers.LSTM(
            hidden_size,
            return_sequences=True,
            stateful=True,
            return_state=True,
            recurrent_dropout=recurrent_dropout,
            dropout=dropout,
        )
        self.dense_main = layers.Dense(1, name="dense_main")
        if self.num_tasks == 2:
            self.dense_aux = layers.Dense(1, name="dense_aux")
        self.states = None

    @tf.function
    def call(self, inputs, **kwargs):
        batch_size = tf.shape(inputs)[0]
        h_init = kwargs.get("h_init", tf.zeros([batch_size, self.hidden_size]))
        c_init = kwargs.get("c_init", tf.zeros([batch_size, self.hidden_size]))
        self.rnn_layer.reset_states(states=[h_init, c_init])
        x, h, c = self.rnn_layer(inputs)
        self.states = h, c
        if self.num_tasks == 1:
            main_prediction = self.dense_main(x)
            return main_prediction
        elif self.num_tasks == 2:
            main_prediction = self.dense_main(x)
            aux_prediction = self.dense_aux(x)
            return tf.concat([main_prediction, aux_prediction], axis=2)
        else:
            raise ValueError(
                f"This model only supports 1 or 2 tasks (not {self.num_tasks})"
            )

Block 2

import numpy as np

Block 3

tasks = 1
epochs = 20
batch_size = 2 # is equivalent to number of segments 
time_steps = 10
n_features = 4
hidden_size = 5
return_state = True
lamb = .5 
# create some fake data based on dimensions specified above 
inputs = np.random.randn(batch_size, time_steps, n_features)
y_obs = np.random.randn(batch_size, time_steps, tasks)
weights = np.random.randn(batch_size, time_steps, tasks)
adj_matrix = np.random.randn(batch_size, batch_size)

# commented out from previous run-pgdl-da ex
model_lstm = LSTMModel(hidden_size=hidden_size, 
                      #gradient_correction=False, 
                      #tasks=tasks, 
                      #lamb=1,
                      dropout=0
                      #grad_log_file=None,
                      #return_state=return_state
                      )

Block 4 (raises If a RNN is stateful... error, but you can continue to Block 5 afterwards)

model_lstm(inputs)

Block 5

model_lstm.compile(optimizer=tf.optimizers.Adam(learning_rate=0.3))

Block 6

model_lstm.rnn_layer.build(inputs.shape)

Block 7 (raises the __array__() takes 1 positional argument but 2 were given error)

model_lstm.fit(x = inputs, 
               y = np.concatenate([y_obs, weights], axis=2), 
               epochs = epochs, 
               batch_size = batch_size)

jzwart commented 2 years ago

This might help? https://github.com/tensorflow/tensorflow/issues/46840#issuecomment-872777398 Our PIL version for the container is 8.4 and looks like downgrading to 8.2 might help

import PIL 
print(PIL.__version__)
8.4.0

jdiaz4302 commented 2 years ago

Their error actually doesn't replicate in my notebook (I'm using the singularity container 2.0 for run-pgdl-da that has PIL version 8.4.0) - the code runs.

replicatingothererror

BUT, I do think looking at more generic issues in other libraries (e.g., numpy) may be promising for this __array__() takes 1 positional argument but 2 were given error.

jzwart commented 2 years ago

Hmm. Yeah I can't reproduce that error either, but I can reproduce the error at the top of the issue thread

jdiaz4302 commented 2 years ago

I think I've made some progress/findings that could lead to a fix:

Regarding using `reset_states` within `def call`

If I simply comment out the reset_states line from the model code, no errors raise. That does, however, mean that the model is using previously calculated states (i.e., i-1) rather than starting with zeros (which seems to be your default preference).

Noticing this made me want to compare how Jake's working stateful lstm is using that method. Rather than calling reset_states in the def call part of the model code, the reservoir project calls reset_states repeatedly in-workflow as needed (for data assimilation).

If I move reset_states into the def call part of the model code for Jake's stateful lstm, I now get the same If a RNN is stateful, it needs to know its batch size... error. Using model.rnn_layer.build(input_shape = x.shape) resolves the issue for Jake's stateful lstm, but (as we know) river-dl's stateful lstm after using model.rnn_layer.build(input_shape = x.shape) fails with the __array__() takes 1 positional argument but 2 were given error. I haven't fixed or fully understood this yet after tinkering around with the code and various forums/google searchs, but what does clearly work is removing the following code...

        batch_size = tf.shape(inputs)[0]
        h_init = kwargs.get("h_init", tf.zeros([batch_size, self.hidden_size]))
        c_init = kwargs.get("c_init", tf.zeros([batch_size, self.hidden_size]))
        self.rnn_layer.reset_states(states=[h_init, c_init])

...from the model class and instead performing state resets before using the model to generate predictions. Example:

from __future__ import print_function, division
import tensorflow as tf
from tensorflow.keras import layers

class LSTMModel(tf.keras.Model):
    def __init__(
        self, hidden_size, num_tasks=1, recurrent_dropout=0, dropout=0,
    ):
        """
        :param hidden_size: [int] the number of hidden units
        :param num_tasks: [int] number of tasks (variables_to_log to be predicted)
        :param recurrent_dropout: [float] value between 0 and 1 for the
        probability of a recurrent element to be zero
        :param dropout: [float] value between 0 and 1 for the probability of an
        input element to be zero
        """
        super().__init__()
        self.hidden_size = hidden_size
        self.num_tasks = num_tasks
        self.rnn_layer = layers.LSTM(
            hidden_size,
            return_sequences=True,
            stateful=True,
            return_state=True,
            recurrent_dropout=recurrent_dropout,
            dropout=dropout,
        )
        self.dense_main = layers.Dense(1, name="dense_main")
        if self.num_tasks == 2:
            self.dense_aux = layers.Dense(1, name="dense_aux")
        self.states = None

    @tf.function
    def call(self, inputs, **kwargs):
        #batch_size = tf.shape(inputs)[0]
        #h_init = kwargs.get("h_init", tf.zeros([batch_size, self.hidden_size]))
        #c_init = kwargs.get("c_init", tf.zeros([batch_size, self.hidden_size]))
        #self.rnn_layer.reset_states(states=[h_init, c_init])
        x, h, c = self.rnn_layer(inputs)
        self.states = h, c
        if self.num_tasks == 1:
            main_prediction = self.dense_main(x)
            return main_prediction
        elif self.num_tasks == 2:
            main_prediction = self.dense_main(x)
            aux_prediction = self.dense_aux(x)
            return tf.concat([main_prediction, aux_prediction], axis=2)
        else:
            raise ValueError(
                f"This model only supports 1 or 2 tasks (not {self.num_tasks})"
            )

x = np.random.normal(size = (20, 10, 5))
LSTM = LSTMModel(5)
LSTM.rnn_layer.build(input_shape = x.shape)
LSTM.rnn_layer.reset_states(states = [tf.zeros([20, 5]), tf.zeros([20, 5])])
LSTM(x)

To approximate some DA situation, the above reset_states also works with tf.random.normal in place of tf.zeros.

Making this kind of change would make the codebase slightly more verbose, requiring reset_states() of varying conditions (use zeros, use i-1, or use DA), but would allow this project (and other projects using it as a library) to use stateful LSTMs.

jzwart commented 2 years ago

Cool! confirmed that it works for me for resetting the states using tf.random.norml or tf.zeros but not when resetting using previous states, which I can't figure out.

x = np.random.normal(size = (20, 10, 5))
LSTM = LSTMModel(5)
LSTM.rnn_layer.build(input_shape = x.shape)
LSTM.rnn_layer.reset_states(states = [tf.zeros([20, 5]), tf.zeros([20, 5])])
LSTM(x)

h, c = LSTM.rnn_layer.states
h.shape
TensorShape([20, 5])
c.shape
TensorShape([20, 5])

so far so good with the h and c states as Tensors with [20,5] shape. But I get an error when resetting states to these h and c states

LSTM.rnn_layer.reset_states(states = [h, c])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_14/3444381920.py in <module>
----> 1 LSTM.rnn_layer.reset_states(states = [h, c])

/opt/venv/reticulate/lib/python3.8/site-packages/tensorflow/python/keras/layers/recurrent.py in reset_states(self, states)
    969                   (batch_size, state)) + ', found shape=' + str(value.shape))
    970         set_value_tuples.append((state, value))
--> 971       backend.batch_set_value(set_value_tuples)
    972 
    973   def get_config(self):

/opt/venv/reticulate/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

/opt/venv/reticulate/lib/python3.8/site-packages/tensorflow/python/keras/backend.py in batch_set_value(tuples)
   3802   if ops.executing_eagerly_outside_functions():
   3803     for x, value in tuples:
-> 3804       x.assign(np.asarray(value, dtype=dtype_numpy(x)))
   3805   else:
   3806     with get_graph().as_default():

/opt/venv/reticulate/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

TypeError: __array__() takes 1 positional argument but 2 were given

jdiaz4302 commented 2 years ago

@jzwart Good idea to test that. If I print all of those that worked and didn't work (i.e., tf.zeros([20, 5]), tf.random.normal([20, 5]), h, and c), I notice that the ones that didn't work (h, c) are tf.Variable rather than tf.Tensor. Using .value() on the tf.Variables seems to fix this:

x = np.random.normal(size = (20, 10, 5))
LSTM = LSTMModel(5)
LSTM.rnn_layer.build(input_shape = x.shape)
LSTM.rnn_layer.reset_states(states = [tf.zeros([20, 5]), tf.zeros([20, 5])])
LSTM(x)
h, c = LSTM.rnn_layer.states
LSTM.rnn_layer.reset_states(states = [h.value(), c.value()])

Unfortunately this fix doesn't apply to our original issue with .reset_states() in the def call part of the model code because the h_init and c_init are already being made via tf.zeros and already tf.Tensor (confirmed)

USGS-R / river-dl