ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
688 stars 95 forks source link

Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege. #1959

Closed hugo-mrc closed 23 hours ago

hugo-mrc commented 1 year ago

Issue Type

Bug

Source

source

Tensorflow Version

tensorflow-rocm 2.2

Custom Code

Yes

OS Platform and Distribution

Ubuntu 20.04

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

ROCm v3.5

GPU model and memory

2 x RX 480 4Go

Current Behaviour?

When switching my LSTM neurons from 'relu' activation function to 'tanh' I get the following error : `Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.`

It also appears when this error doesn't occur (ie. when the program work) I have this warnings printed at the beginning:
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

Standalone code to reproduce the issue

import os
import random
import time

import numpy as np
import tensorflow as tf
from tqdm import tqdm
from collections import deque

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

print(tf.config.experimental.list_physical_devices("GPU"))
mirrored_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(tf.distribute.experimental.CollectiveCommunication.RING)

window_size = 5
episodes = 20
batch_size = 32
NAME = f"Blackstonev1-LSTM-32x64x64-{int(time.time())}"
tensorboard = tf.keras.callbacks.TensorBoard(log_dir="logs\{}".format(NAME))

class AIAgent:
    def __init__(self, state_size, action_space=3, model_name=NAME):  # Stay, Buy, Sell
        self.state_size = state_size
        self.action_space = action_space
        self.memory = deque(maxlen=2000)
        self.inventory = []
        self.margin_inventory = []
        self.model_name = model_name

        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_final = 0.05
        self.epsilon_decay = 0.995

        self.model = self.model_builder()

    def model_builder(self):
        with mirrored_strategy.scope():
            model = tf.keras.models.Sequential()

            model.add(tf.keras.Input(shape=(window_size, 2)))

            model.add(tf.keras.layers.LSTM(units=32, activation='relu', return_sequences=True))
            model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=True))
            model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=False))
            model.add(tf.keras.layers.Dense(units=self.action_space, activation='linear'))
            model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))

        return model

    def trade(self, state):
        rdm = random.random()
        if rdm <= self.epsilon:
            rdm_act = random.randrange(self.action_space)
            print(f"random: {rdm_act}")
            return rdm_act

        actions = self.model.predict(state)
        argmax = np.argmax(actions[0])
        print(f'model: {argmax}')
        return argmax

    def batch_train(self, batch_size):
        batch = []
        for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
            batch.append(self.memory[i])

        for state, action, reward, next_state, done in batch:
            reward = reward

            if not done:
                reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])

            target = self.model.predict(state)
            target[0][action] = reward

            self.model.fit(state, target, epochs=1, verbose=0, callbacks=[tensorboard])

        if self.epsilon > self.epsilon_final:
            self.epsilon *= self.epsilon_decay

def state_creator(data, timestep, window_size):
    starting_id = timestep - window_size + 1

    if starting_id >= 0:
        windowed_data = data[starting_id:timestep + 1]
    else:
        windowed_data = - starting_id * [data[0]] + list(data[0:timestep + 1])

    state = windowed_data

    return np.array([state])

def main(batch_size, window_size, episodes):
    data = load_data(stock_name) # Replace with your own input here
    data_samples = len(data) - 1
    agent = AIAgent(window_size)
    agent.model.summary()

    for episode in range(1, episodes + 1):
        print("Episode: {}/{}".format(episode, episodes))
        state = state_creator(data, 0, window_size)

        total_profit = 0
        agent.inventory = []

        for t in tqdm(range(data_samples)):
            action = agent.trade(state)

            next_state = state_creator(data, t + 1, window_size)
            reward = 0

            if action == 1:
                # Do that
                continue
            elif action == 2:
                # Do that
                continue

            elif action == 0:
                # Do that
                continue

            if t == data_samples - 1:
                done = True
            else:
                done = False

            agent.memory.append((state, action, reward, next_state, done))
            state = next_state

            if len(agent.memory) > batch_size:
                agent.batch_train(batch_size)

        agent.model.save(f"{agent.model_name}_{episode}.h5")

Relevant log output

No response

DanielWicz commented 1 year ago

Also I have similar problem when Tensorflow-romc is a part of package, but can't reproduce it alone.

schung-amd commented 4 days ago

Hi @hugo-mrc @DanielWicz, sorry for the long delay. Are you still experiencing this issue on current ROCm/hardware?

schung-amd commented 23 hours ago

Closing for now as the software versions in the issue are quite stale. Feel free to comment if you are still experiencing this issue and we can reopen.