Tensorflow 2.5.0 model training and hyperparameter tuning via Keras-Tuner inside of a web-app developed in Dash and deployed in Gunicorn 20.1.0 eventually causes CPU Usage in container statistics to plummet and the program execution of model training stalls

tobby-lie commented 2 years ago

[x] I have tried with the latest version of Docker Desktop
[x] I have tried disabling enabled experimental features
[ ] I have uploaded Diagnostics
Diagnostics ID: Unable to generate diagnostics ID

Actual behavior

Based on the details provided in the expected behavior section, what actually happens is my program will hyperparameter tune my model successfully but will stall in the middle of model training. When this happens CPU Usage plummets in the container statistics.

Expected behavior

Running a web-app within a Docker container utilizing gunicorn 20.1.0, Tensorflow 2.5.0, and Keras-Tuner. Essentially I am running a web-app which trains a Tensorflow neural network model when a button is clicked which triggers a Dash callback to train a model. Keras-Tuner is utilized to hyperparameter tune the model before training begins.

These are the contents of my Dockerfile used to build the image:

FROM python:3.7

ADD requirements.txt /app/
WORKDIR /app

# install system dependencies
RUN apt-get update \
    && apt-get -y install gcc make \
    && rm -rf /var/lib/apt/lists/*s

# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable

# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

RUN pip install -r requirements.txt

ADD . /app

COPY generic_utils.py /usr/local/lib/python3.7/site-packages/keras/utils/generic_utils.py

EXPOSE 8050

# RUN pwd

# CMD ["python", "api/Data_Scrapers/nbi_scraper.py"]
# CMD ["python", "app.py"]

# CMD ["gunicorn", "--worker-class=gthread", "--workers=1", "--threads=8", "-b", "0.0.0.0:8050", "app:server"]
CMD ["gunicorn", "--worker-class=gevent", "--timeout", "10000", "--workers=1", "--threads=8", "-b", "0.0.0.0:8050", "app:server"]

As you can see, I've tried using 2 types of worker classes 'gthread' and 'gevent' where neither work.

I have run this app using python app.py locally without a container and it runs successfully with no stalls on my Windows machine. I expected that this behavior would be consistent within a Docker container.

Information

Is it reproducible? Yes
Is the problem new? Yes
Did the problem appear with an update? No
Windows Version: Windows 10 Pro
Docker Desktop Version: 4.2.0
WSL2 or Hyper-V backend? WSL2
Are you running inside a virtualized Windows e.g. on a cloud server or a VM: No

Steps to reproduce the behavior

Clone the docker_stall_issue branch of this repository https://github.com/tobby-lie/Bridge_Management_As_A_Service/tree/docker_stall_issues (if access is required into this private repo, that can be provided)
Navigate to the src/dash directory
Build the Dockerfile
Run the Docker container for the built image using a port of your choice to expose
Navigate to localhost:
You will be taken to a home page of a web app, go to the dropdown menu and click 'Forecast'
When navigated to the forecast page, click the orange button which will populate another orange button labeled 'Start Training'
In command line use the 'Docker logs -f ' command to monitor the progress of the program
The first model will run hyperparameter tuning for 9 trials and will then begin model training. Usually around the 1400th epoch, the program execution stalls and CPU Usage within the container plummets.

docker-robott commented 2 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

docker-robott commented 2 years ago

Closed issues are locked after 30 days of inactivity. This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle locked

docker / for-win