obl1.sh script Freezing

ravihammond commented 2 years ago

I've successfully trained with the provided iql.sh, belief.sh, and belief_obl0.sh scripts, but when running obl1.sh, the training freezes after a few hours. I've seen it freeze at different times (~300, ~400, ~500 epochs). I've put a bunch of prints in the selfplay.py training loop, but it appears to be freezing at different unrelated places in the code. To me, this smells like a multi-threading issue. When the terminal freezes there is no error message, and I can't ctrl-c to exist the program - it's completely unresponsive. Also, when it freezes, the program still holds on to all of the RAM and VRAM - which gets cleared if I kill the terminal. Neither the RAM or VRAM has run out of memory, RAM is sitting at ~80% and VRAM < 50%.

Software inside a docker container:

Ubuntu 18.04
CUDA 11.0
Python 3.7.4
Pytorch 1.7.1

Hardware:

GPU's: 3x Geforce RTX 3090
CPU: 1x AMD Ryzen Threadripper 1920X

My hardware forces me to use newer CUDA and Pytorch versions.

To account for fewer GPU's, I've moved the belief and acting to the same GPUs (maybe this could be contributing to the freezing). Here is the obl1.sh script I'm using to train:

python selfplay.py \
       --save_dir exps/obl1 \
       --num_thread 24 \
       --num_game_per_thread 24 \
       --sad 0 \
       --act_base_eps 0.1 \
       --act_eps_alpha 7 \
       --lr 6.25e-05 \
       --eps 1.5e-05 \
       --grad_clip 5 \
       --gamma 0.999 \
       --seed 2254257 \
       --batchsize 128 \
       --burn_in_frames 10000 \
       --replay_buffer_size 100000 \
       --epoch_len 1000 \
       --num_epoch 1500 \
       --num_player 2 \
       --rnn_hid_dim 512 \
       --multi_step 1 \
       --train_device cuda:0 \
       --act_device cuda:1,cuda:2 \
       --num_lstm_layer 2 \
       --boltzmann_act 0 \
       --min_t 0.01 \
       --max_t 0.1 \
       --off_belief 1 \
       --num_fict_sample 10 \
       --belief_device cuda:1,cuda:2 \
       --belief_model exps/belief_obl0/model0.pthw \
       --load_model None \
       --net publ-lstm \

This is what I plan to do moving forward to attempt to fix the issue:

See if I can get the obl1.sh script to finish when running less epochs (1-2).
Try running it with 1 thread (if possible).
Start to strip the code into smaller pieces to investigate each moving part.

Hopefully, it won't take too much work to identify the freeze, but if I could get some of your insights into what might possibly be causing the freeze, I'll be very grateful!

Thanks so much :)

hengyuan-hu commented 2 years ago

Frankly, I haven't never successfully compiled this code with PyTorch 1.7.1. It always gives me some pybind errors. I am not sure how you compiled it, but incompatible library may cause silent deadlocks due to memory being freed illegally. The latest PyTorch that I made it work was 1.7.0, which then had a problem and the fix is discussed here. https://github.com/facebookresearch/hanabi_SAD/issues/20

We have been using this code for years now and it should not have any bug that causes freezing. But unfortunately we have only tested on our pretty old library stacks. I tend to think it is a compilation/library compatibility issue but I may be wrong.

Sorry for not being able to provide a direct solution here. Regarding the training config, it may be better to put act on cuda:1 and belief_device on cuda:2. But that should not be the reason of freezing. --num_game_per_thread depends on how fast your single core is, so you can make it larger and it does not have to be the same as --num_thread.

ravihammond commented 2 years ago

Hey Hengyuan, thanks for the swift response!

I also ran into the same errors as you. I couldn't compile it due to pybind errors. To fix these, I changed the pybind version to the latest "stable" branch, commit acae930123bcd331aff73a30e4fb7e2103fd7fca. If you're right, it seems the pybind library is incompatible, and I might be experiencing a silent deadlock, which means that pulling your OBL codebase apart may not solve the issue. There are some new updates to the "stable" branch of pybind, I might try updating to that and re-building. If that doesn't work, I'll try building with progressively older versions of pybind.

In regards to the facebookresearch/hanabi_SAD#20 issue, I also had this problem, where I've already used the suggested solutions discussed in that discussion to fix it. I also had another annoying bug appear that I managed to fix. If you're interested in hearing about it, maybe we could chat about it in a separate issue.

To get OBL compiled successfully, I tried different combinations of python and pytorch that my hardware allowed. If you're interested, here is the Dockerfile I used to get it to compile.

FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get -y upgrade && apt-get install -y apt-utils

# Install some basic utilities
RUN apt-get install -y \
    net-tools iputils-ping \
    build-essential cmake git \
    curl wget \
    vim \
    zip p7zip-full p7zip-rar bzip2 \
    ca-certificates \
    imagemagick ffmpeg \
    libopenmpi-dev libomp5\
    sudo \
    libx11-6 \
    && rm -rf /var/lib/apt/lists/*

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV CONDA_AUTO_UPDATE_CONDA=false
ENV PATH /opt/conda/bin:$PATH

# Use Python version 3.7.4
ARG CONDA_VERSION=py37_4.10.3

# Install conda
RUN set -x && \
    UNAME_M="$(uname -m)" && \
    if [ "${UNAME_M}" = "x86_64" ]; then \
        MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-x86_64.sh"; \
    elif [ "${UNAME_M}" = "s390x" ]; then \
        MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-s390x.sh"; \
    elif [ "${UNAME_M}" = "aarch64" ]; then \
        MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-aarch64.sh"; \
    elif [ "${UNAME_M}" = "ppc64le" ]; then \
        MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-ppc64le.sh"; \
    fi && \
    wget "${MINICONDA_URL}" -O miniconda.sh -q && \
    mkdir -p /opt && \
    sh miniconda.sh -b -p /opt/conda && \
    rm miniconda.sh && \
    ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate base" >> ~/.bashrc && \
    find /opt/conda/ -follow -type f -name '*.a' -delete && \
    find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
    /opt/conda/bin/conda clean -afy

# Install python libraries
RUN /opt/conda/bin/conda install -y \
    numpy \
    jupyter \
    seaborn plotly \
    scikit-learn scikit-image \
    dask dask-image \
    beautifulsoup4

# Install some more libraries
RUN /opt/conda/bin/conda install -yc conda-forge \
    pandas matplotlib \
    ffmpeg \
    tqdm \
    cmake \ 
    xgboost lightgbm catboost \
    mlxtend \
    shap \
    uvicorn starlette aiohttp

# Install Pytorch 1.7.1
RUN pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

RUN pip install \
    psutil

RUN echo "source setup_conda.bash" >> ~/.bashrc 

WORKDIR /app

CMD ["python3"]

hengyuan-hu commented 2 years ago

Thanks for the detailed comments. This is impressive! I will try myself to install it with PyTorch1.7.1. Do you mind putting another bug that you countered to an issue so that I can also try running it with 1.7.1?

A few other things worth trying: 1) Try a newer cuda version. I have previously encountered deadlock due to some cuda internal problems. Switching to a different (newer or older) version may fix it. 2) Last resort to fully eliminate pytorch/pybind problems. You can compile pytorch1.5.1 from source with your customized cuda version as instructed here https://github.com/pytorch/pytorch#from-source. It is not hard if you have control to install packages, roughly take 20 mins with SSD.

ravihammond commented 2 years ago

Great to hear that you're going to try and get it working with PyTorch 1.7.1 too! I've created a separate issue that details the solution to the other bug I found here. Also, if you're going to us my dockerfile above, here is the setup_conda.bash file I add to the bashrc:

# set path
CONDA_PREFIX=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export CPATH=${CONDA_PREFIX}/include:${CPATH}
export LIBRARY_PATH=${CONDA_PREFIX}/lib:${LIBRARY_PATH}
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}

# avoid tensor operation using all cpu cores
export OMP_NUM_THREADS=1

Thanks for the suggestions to try. I'll get started with them to solve this deadlock problem.

hengyuan-hu commented 2 years ago

Resolved with latest pytorch & cuda.

facebookresearch / off-belief-learning

obl1.sh script Freezing #3