CMU-Perceptual-Computing-Lab / openpose_train

Training repository for OpenPose
https://github.com/CMU-Perceptual-Computing-Lab/openpose
Other
579 stars 185 forks source link

Ubuntu 16.04: No Change in Loss While Training OpenPose for Body25 #33

Closed justvasq closed 4 years ago

justvasq commented 4 years ago

Posting rules

  1. Duplicated posts will not be answered. Check the FAQ section, other GitHub issues, and general documentation before posting. E.g., low-speed, out-of-memory, output format, 0-people detected, installation issues, ...).
  2. Fill the Your System Configuration section (all of it or it will not be answered!) if you are facing an error or unexpected behavior. Feature requests or some other type of posts might not require it.
  3. No questions about 3rd party libraries:
    • Caffe errors/issues, check Caffe documentation.
    • CUDA check failed errors: They are usually fixed by re-installing CUDA, then re-installing the proper cuDNN version, and then re-compiling (or re-installing) OpenPose. Otherwise, check for help in CUDA forums.
    • OpenCV errors: Install the default/pre-compiled OpenCV or check for online help.
  4. Set a proper issue title: add the Ubuntu/Windows keyword and be specific (e.g., do not call it: Error).
  5. Only English comments. Posts which do not follow these rules will be ignored, closed, or reported with no further clarification.

Issue Summary

I'm trying to train OpenPose with my own data, but I decided to test my configuration by training OpenPose with the default data first. I followed the steps from https://github.com/CMU-Perceptual-Computing-Lab/openpose_train/blob/master/training/README.md. My training loss does not decrease after tens of thousands of iterations.

3_13_training_graph

Executed Command (if any)

Note: add --logging_level 0 --disable_multi_thread to get higher debug information.

sudo bash train_pose.sh 0

OpenPose Output (if any)

3_13_training_test_log.txt

Errors (if any)

Type of Issue

You might select multiple topics, delete the rest:

Your System Configuration

  1. Whole console output (if errors appeared), paste the error to PasteBin and then paste the link here: LINK

  2. OpenPose version: Latest GitHub code? Or specific commit (e.g., d52878f)? Or specific version from Release section (e.g., 1.2.0)?

Most recent OpenPose Train version

  1. General configuration:

    • Installation mode: CMake, sh script, manual Makefile installation, ... (Ubuntu); CMake, ... (Windows); ...? Just followed OpenPose_Train README
    • Operating system (lsb_release -a in Ubuntu): Ubuntu
    • Operating system version (e.g., Ubuntu 16, Windows 10, ...): Ubuntu 16.04
    • Release or Debug mode? (by default: release):
    • Compiler (gcc --version in Ubuntu or VS version in Windows): 5.4.0, ... (Ubuntu); VS2015 Enterprise Update 3, VS2017 community, ... (Windows); ...? gcc 5.4.0
  2. Non-default settings:

    • 3-D Reconstruction module added? (by default: no):
    • Any other custom CMake configuration with respect to the default version? (by default: no): No
  3. 3rd-party software:

    • Caffe version: Default from OpenPose, custom version, ...? Default OpenPose train caffe (https://github.com/CMU-Perceptual-Computing-Lab/openpose_caffe_train)
    • CMake version (cmake --version in Ubuntu): cmake 3.5.1
    • OpenCV version: pre-compiled apt-get install libopencv-dev (only Ubuntu); OpenPose default (only Windows); compiled from source? If so, 2.4.9, 2.4.12, 3.1, 3.2?; ...? 4.2.0
  4. If GPU mode issue:

    • CUDA version (cat /usr/local/cuda/version.txt in most cases): 10.2.89
    • cuDNN version: cuDNN 7
    • GPU model (nvidia-smi in Ubuntu): Quadro P6000
  5. If CPU-only mode issue:

    • CPU brand & model:
    • Total RAM memory available:
  6. If Python API:

    • Python version: 2.7, 3.7, ...? 3.5.2
    • Numpy version (python -c "import numpy; print numpy.version.version" in Ubuntu): 1.18.1
  7. If Windows system:

    • Portable demo or compiled library?
  8. If speed performance issue:

    • Report OpenPose timing speed based on this link.
gineshidalgo99 commented 4 years ago

Actually, it is decreasing (you can check it by doing that plot in loglog scale), so that seems to be the expected decreasing rate. I found that was the case, although the accuracy on COCO kept increasing until about 600k-800k iterations, although the actual training loss were following a tendency similar to your curve.