Svdvoort / PrognosAIs_glioma

Predicting genetics and providing segmentation of glioma
Apache License 2.0
16 stars 1 forks source link

Gradients do not exist for variables during training #7

Closed fahadahmedkhokhar closed 2 weeks ago

fahadahmedkhokhar commented 2 weeks ago

I am using your custom_definition.py as a model for training and I am facing this error when training start.

WARNING:tensorflow:Gradients do not exist for variables ['sync_batch_normalization_8/gamma:0', 'sync_batch_normalization_8/beta:0', 'conv3d_21/kernel:0', 'conv3d_21/bias:0'] when minimizing the loss.
2024-08-26 14:41:21 prognosais WARNING Gradients do not exist for variables ['sync_batch_normalization_8/gamma:0', 'sync_batch_normalization_8/beta:0', 'conv3d_21/kernel:0', 'conv3d_21/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['sync_batch_normalization_8/gamma:0', 'sync_batch_normalization_8/beta:0', 'conv3d_21/kernel:0', 'conv3d_21/bias:0'] when minimizing the loss.
2024-08-26 14:41:23 prognosais WARNING Gradients do not exist for variables ['sync_batch_normalization_8/gamma:0', 'sync_batch_normalization_8/beta:0', 'conv3d_21/kernel:0', 'conv3d_21/bias:0'] when minimizing the loss.
2024-08-26 14:41:28.976749: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 14224387968 exceeds 10% of free system memory.
Killed

Moreover, when Tensorflow with GPU is compiled there is issue of Memory Leakage as well. Can you please send me your requirements.txt file on which you trained your model?

Svdvoort commented 2 weeks ago

First I would like to request not opening new issues if you already have an issue open with the same problem. I will try to help you to some extent, even though as already mentioned here as well, PrognosAIs is not supported anymore. You probably are better off using some other package or otherwise indicate why specifically you want to run this code so I can try to help with that. For example to try and reproduce our exact experiments.

Another note as there seem to be spammers that make use of this issue: do not download anything from any links.

Some questions:

  1. In #6 I asked you what you were trying to do, but the information you provided was not very clear. Can you please share the script/code that you are trying to run?
  2. Please provide more information about what hardware you are using to try and train the model. Is it on CPU or on GPU? And what kind of CPU or GPU? From your error log I suspect there is not directly a memory leak, but simply that your GPU does not have enough memory. I suspect you need at least 12GB of GPU memory, with appropriate CUDA capabilities. Anything less than a RTX2080Ti will not work.
  3. You can try out the docker if you are having issues with requirements as they are fixed in there and are certain to work.
fahadahmedkhokhar commented 2 weeks ago

I have some clinical data and I am trying to train the model using my customized dataset. I am using this code for training.

from PrognosAIs.Model.Architectures.UNet import Unet
from tensorflow.keras.layers import Concatenate, BatchNormalization, Conv3D, GlobalAveragePooling3D, Dense, Activation, ReLU, Conv3DTranspose, GlobalMaxPool3D
from tensorflow.keras import Model, Input
import tensorflow as tf
import tensorflow_addons as tfa
import numpy as np
import logging

class PSNET_3D(Unet):
    dims = 3

    def make_inputs(
        self, input_shapes: dict, input_dtype: str, squeeze_inputs: bool = True
    ):
        inputs = {}
        for i_input_name, i_input_shape in input_shapes.items():
            inputs[i_input_name] = Input(shape=i_input_shape, name=i_input_name, dtype="float16")

        if squeeze_inputs and len(inputs) == 1:
            inputs = list(inputs.values())[0]

        return inputs

    def get_init_filter_size(self):
        if self.model_config is not None and "filter_size" in self.model_config:
            return self.model_config["filter_size"]
        else:
            return 7

    def get_init_stride_size(self):
        if self.model_config is not None and "stride_size" in self.model_config:
            return self.model_config["stride_size"]
        else:
            return 3

    def make_norm_layer(self, layer):
        if self.model_config is not None and "norm_layer" in self.model_config:
            norm_setting = self.model_config["norm_layer"]
            if norm_setting == "batch":
                return BatchNormalization()(layer)
            elif norm_setting == "batch_sync":
                return tf.keras.layers.experimental.SyncBatchNormalization()(layer)
            elif norm_setting == "instance":
                return tfa.layers.InstanceNormalization()(layer)
            else:
                return layer
        else:
            return layer

    def get_stride_activations(self):
        if self.model_config is not None and "stride_activation" in self.model_config:
            return self.model_config["stride_activation"]
        else:
            return "linear"

    def get_output_type(self):
        if self.model_config is not None and "output_type" in self.model_config:
            return self.model_config["output_type"]
        else:
            return "softmax"

    def make_global_pool_layer(self, layer):
        if self.model_config is not None and "global_pool" in self.model_config:
            layer_setting = self.model_config["global_pool"]
            if layer_setting == "average":
                layer = GlobalAveragePooling3D()(layer)
            elif layer_setting == "max":
                layer = GlobalMaxPool3D()(layer)
            else:
                layer = GlobalAveragePooling3D()(layer)
        else:
            layer = GlobalAveragePooling3D()(layer)

        return layer

    def get_gap_after_dropout(self):
        if self.model_config is not None and "gap_after_dropout" in self.model_config:
            return self.model_config["gap_after_dropout"]
        else:
            return False

    def get_final_dense_units(self):
        if self.model_config is not None and "dense_units" in self.model_config:
            return self.model_config["dense_units"]
        else:
            return 512

    def get_kernel_regularizer(self):
        if self.model_config is not None and "l2_norm" in self.model_config:
            return tf.keras.regularizers.l2(l=self.model_config["l2_norm"])
        else:
            return None

    def get_use_additional_convs(self):
        if self.model_config is not None and "convs" in self.model_config:
            return self.model_config["convs"]
        else:
            return False

    def get_use_upsample_genetic_features(self):
        if self.model_config is not None and "upsample_features" in self.model_config:
            return self.model_config["upsample_features"]
        else:
            return True

    def get_final_conv_layers(self):
        if self.model_config is not None and "final_conv_layers" in self.model_config:
            return self.model_config["final_conv_layers"]
        else:
            return 256

    def create_model(self):
        self.init_dimensionality(self.dims)
        self.inputs = self.make_inputs(self.input_shapes, self.input_data_type)

        self.N_filters = self.get_number_of_filters()
        self.depth = self.get_depth()
        filter_size = self.get_init_filter_size()
        stride_size = self.get_init_stride_size()
        activations = self.get_stride_activations()
        output_type = self.get_output_type()
        gap_after_dropout = self.get_gap_after_dropout()
        final_dense_unit = self.get_final_dense_units()
        kernel_regularizer = self.get_kernel_regularizer()
        additional_convs = self.get_use_additional_convs()
        upsample_features = self.get_use_upsample_genetic_features()
        final_conv_layers = self.get_final_conv_layers()

        head = self.inputs
        skip_layers = []
        gap_layers = []

        for i_depth in range(self.depth - 1):
            head = self.get_conv_block(head, self.N_filters * (2 ** i_depth), activation=activations, kernel_regularizer=kernel_regularizer)

            if i_depth == 0:
                if not gap_after_dropout:
                    gap_layers.append(self.make_global_pool_layer(head))
                # head = self.make_norm_layer(head)
                head = self.make_dropout_layer(head)
                if gap_after_dropout:
                    gap_layers.append(self.make_global_pool_layer(head))
                skip_layers.append(head)
                head = Conv3D(self.N_filters * (2 ** i_depth), filter_size, strides=stride_size, padding="same", activation=activations)(head)
            else:
                head = self.get_conv_block(head, self.N_filters * (2 ** i_depth), activation=activations, kernel_regularizer=kernel_regularizer)
                if not gap_after_dropout:
                    gap_layers.append(self.make_global_pool_layer(head))
                # head = self.make_norm_layer(head)
                head = self.make_dropout_layer(head)
                if gap_after_dropout:
                    gap_layers.append(self.make_global_pool_layer(head))
                skip_layers.append(head)
                head = self.get_padding_block(head)
                head = self.get_pool_block(head)
            head = self.make_norm_layer(head)

        head = self.get_conv_block(head, self.N_filters * (2 ** (self.depth - 1)), activation=activations, kernel_regularizer=kernel_regularizer)
        head = self.get_conv_block(head, self.N_filters * (2 ** (self.depth - 1)), activation=activations, kernel_regularizer=kernel_regularizer)
        # head = self.make_norm_layer(head)
        if not gap_after_dropout:
            gap_layers.append(self.make_global_pool_layer(head))
        head = self.make_dropout_layer(head)
        if gap_after_dropout:
            gap_layers.append(self.make_global_pool_layer(head))
        head = self.make_norm_layer(head)
        head_lowest = head

        for i_depth in range(self.depth - 2, -1, -1):
            if i_depth == 0:
                head = Conv3DTranspose(self.N_filters * (2 ** i_depth), filter_size, strides=stride_size, padding="same", activation=activations)(head)
            else:
                head = self.get_upsampling_block(head, self.N_filters * (2 ** i_depth), activation=activations, kernel_regularizer=kernel_regularizer)

            head = self.get_cropping_block(skip_layers[i_depth], head)
            head = Concatenate()([skip_layers[i_depth], head])
            head = self.get_conv_block(head, self.N_filters * (2 ** i_depth), activation=activations, kernel_regularizer=kernel_regularizer)
            head = self.get_conv_block(head, self.N_filters * (2 ** i_depth), activation=activations, kernel_regularizer=kernel_regularizer)
            # head = self.make_norm_layer(head)
            if not gap_after_dropout and upsample_features:
                gap_layers.append(self.make_global_pool_layer(head))
            head = self.make_dropout_layer(head)
            if gap_after_dropout and upsample_features:
                gap_layers.append(self.make_global_pool_layer(head))
            head = self.make_norm_layer(head)

        # head = self.make_dropout_layer(head)
        if output_type == "softmax":
            head = self.conv_func(
                        filters=2,
                        kernel_size=1,
                        padding="same"
            )(head)
            out_mask = Activation(activation="softmax", dtype="float32", name="MASK")(head)
        elif output_type == "sigmoid":
            head = self.conv_func(
            filters=1,
            kernel_size=1,
            padding="same"
            )(head)
            out_mask = Activation(activation="sigmoid", dtype="float32", name="MASK")(head)

        genetic_features = Concatenate()(gap_layers)
        # if not gap_after_dropout:
        genetic_features = self.make_dropout_layer(genetic_features)

        branch_IDH = Dense(final_dense_unit, activation="relu")(genetic_features)

        branch_IDH = Dense(2)(branch_IDH)
        out_IDH = Activation(activation="softmax", dtype="float32", name="IDH")(branch_IDH)

        branch_1p19q = Dense(final_dense_unit, activation="relu")(genetic_features)

        branch_1p19q = Dense(2)(branch_1p19q)
        out_1p19q = Activation(activation="softmax", dtype="float32", name="1p19q")(branch_1p19q)

        branch_grade = Dense(final_dense_unit, activation="relu")(genetic_features)

        branch_grade = Dense(3)(branch_grade)
        out_grade = Activation(activation="softmax", dtype="float32", name="Grade")(branch_grade)

        predictions = [out_IDH, out_1p19q, out_grade, out_mask]

        model = Model(inputs=self.inputs, outputs=predictions)

        return model

class AdamW(tfa.optimizers.AdamW):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

class AverageEarlyStopping(tf.keras.callbacks.Callback):

  def __init__(self,
               monitor='val_loss',
               min_delta=0,
               patience=0,
               verbose=0,
               mode='auto',
               baseline=None,
               restore_best_weights=False,
               average_fraction=5):
    super().__init__()

    self.monitor = monitor
    self.patience = patience
    self.verbose = verbose
    self.baseline = baseline
    self.min_delta = abs(min_delta)
    self.wait = 0
    self.stopped_epoch = 0
    self.restore_best_weights = restore_best_weights
    self.best_weights = None
    self.average_fraction = average_fraction
    self.metric_history = []

    if mode not in ['auto', 'min', 'max']:
      logging.warning('EarlyStopping mode %s is unknown, '
                      'fallback to auto mode.', mode)
      mode = 'auto'

    if mode == 'min':
      self.monitor_op = np.less
    elif mode == 'max':
      self.monitor_op = np.greater
    else:
      if 'acc' in self.monitor:
        self.monitor_op = np.greater
      else:
        self.monitor_op = np.less

    if self.monitor_op == np.greater:
      self.min_delta *= 1
    else:
      self.min_delta *= -1

  def on_train_begin(self, logs=None):
    # Allow instances to be re-used
    self.wait = 0
    self.stopped_epoch = 0
    if self.baseline is not None:
      self.best = self.baseline
    else:
      self.best = np.Inf if self.monitor_op == np.less else -np.Inf

  def on_epoch_end(self, epoch, logs=None):
    current = self.get_monitor_value(logs)
    if current is None:
        return

    self.metric_history.append(current)

    if epoch > 5:
        if len(self.metric_history) > 5:
            self.metric_history.pop(0)
        current = np.mean(self.metric_history)
        # print("Best mean: {mean}, current mean: {current}".format(mean=self.best, current=current))
        print("Validation loss changed by {change}".format(change=current - self.best))
        if self.monitor_op(current - self.min_delta, self.best):

            self.best = current
            self.wait = 0
            if self.restore_best_weights:
                self.best_weights = self.model.get_weights()
        else:
            self.wait += 1

            if self.wait >= self.patience:
                self.stopped_epoch = epoch
                self.model.stop_training = True
                if self.restore_best_weights:
                    if self.verbose > 0:
                        print('Restoring model weights from the end of the best epoch.')
                    self.model.set_weights(self.best_weights)

  def on_train_end(self, logs=None):
    if self.stopped_epoch > 0 and self.verbose > 0:
      print('Epoch %05d: early stopping' % (self.stopped_epoch + 1))

  def get_monitor_value(self, logs):
    logs = logs or {}
    monitor_value = logs.get(self.monitor)
    if monitor_value is None:
      logging.warning('Early stopping conditioned on metric `%s` '
                      'which is not available. Available metrics are: %s',
                      self.monitor, ','.join(list(logs.keys())))
    return monitor_value
  1. I have following Hardware specification: core i5 12th gen x 16 CPU, RTX 4060
  2. While trying with Docker, I also face issues while installing it.
Svdvoort commented 2 weeks ago
  1. Thanks for sharing this. This only contains the model definitions though, not the actual training script. Somewhere you must load your data and call .fit() on the model, I don't see that in this code. Could you share the script where you initialize and fit the model?
  2. A base RTX4060 is probably insufficient as it only has 8GB of VRAM. A RTX4060Ti with 16GB might work. In case you do have an RTX4060Ti with 16GB of vram: did you turn on mixed precision while fitting the model? In case you have a RTX4060 I'm afraid the model is simply too big for the GPU. You can try reducing the number of convolutional layers in that case.
  3. Do you face issues installing Docker or while running the PrognosAIs docker?
fahadahmedkhokhar commented 2 weeks ago
  1. I am using this code

    def train_model(self) -> str:
    
        with self.distribution_strategy.scope():
            train_data = self.train_data_generator.get_tf_dataset()
            if self.do_validation:
                validation_data = self.validation_data_generator.get_tf_dataset()
            else:
                validation_data = None
    
        epochs = self.config.get_N_epoch()
        callbacks = self.setup_callbacks()
        logging.info("Starting training")
        logging.debug(
            (
                "Training with following parameters:\n"
                "Train data: {train}\n"
                "Validation data: {val}\n"
                "Epochs: {epoch}\n"
                "Callbacks: {callback}\n"
                "Class weights: {weights}\n"
                "Steps per epoch: {steps}\n"
                "Validation setps: {val_steps}\n"
            ).format(
                train=train_data,
                val=validation_data,
                epoch=epochs,
                callback=callbacks,
                weights=self.class_weights,
                steps=self.steps_per_epoch,
                val_steps=self.validation_steps,
            ),
        )
    
        self.model.fit(
            train_data,
            validation_data=validation_data,
            epochs=epochs,
            callbacks=callbacks,
            shuffle=False,
            class_weight=self.class_weights,
            verbose=1,
            steps_per_epoch=self.steps_per_epoch,
            validation_steps=self.validation_steps,
        )
    
        logging.info("Finished training")
        if self.worker_index == 0:
            self.model.save(self.model_save_file)
            logging.info("Model saved to {save_file}".format(save_file=self.model_save_file))
        else:
            # We need to save the model for other workers as well, otherwise
            # We run into errors, however we instantly delete because we dont actually
            # Need the other models
            model_save_file = ".".join(
                [
                    os.path.join(self.output_folder, self.save_name + "_" + str(self.worker_index)),
                    PrognosAIs.Constants.HDF5_EXTENSION,
                ],
            )
    
            self.model.save(model_save_file)
            os.remove(model_save_file)
        return self.model_save_file
  2. I am using the Base RTX4060.
  3. While installing Docker I am facing issue while changing in the PrognosAIs library inside the container.
Svdvoort commented 2 weeks ago

Based on the error message and this information I'm afraid that the issue is that your GPU card cannot handle the training: the model is quite large and the RTX4060 simply does not have enough memory unfortunately. One thing you can try is to reduce the batch size to 1, if you hadn't already done so, but I'm afraid that even that is probably not enough to make the model fit. It's not so much the batches as the model that takes up memory.

Other options is to make sure that your GPU supports mixed precision and training takes advantage of it. Check whether you see the message "GPU supports a mixed float16 policy" or "GPU support float16 precision policy" in the logs. Then try to set the float policy in the config to "mixed" or "float16" to force it.

Unfortunately, that's the only advice I can give you. For evaluation you should be fine to run the model, but training of the model required 8 RTX2080Ti's, each with 11GB of memory. Training on a single RTX4060 is therefore going to be a bit of a challenge if you want to train the exact same model. You can of course reduce the number of filters per layer or the number of layers, but then the model is not the same. In conclusion: this is not a problem with the code, but unfortunately a hardware limitation, therefore I'm closing the issue.