Open StateGovernment opened 1 year ago
@StateGovernment please post the error message.
Is there a reason you want to use A100? TPU trains really fast and the model weights can be easily converted to pytorch weights with diffusers
later if needed.
I haven't run this code with GPUs, but it should technically work. My guess is that the machine type needs to be changed to one that supports A100s. If you're using a single A100 (40GB), change machine_type line to a2-highgpu-1g
and call gcp_run_train.py
with --accelerator-count=1
.
For a compatibility of machine types to GPU types take a look at this link
You'll also need to install the jaxlib cuda version, change this line to:
RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Rebuild and push the container to gcr and run gcp_run_train.py again.
@entrpn I only have a TPU quota of 8, so the training fails after 4-5mins, I requested to increase the quota to 30 which will take a while. So in the meanwhile I'd like to see how the model trains on A100s, probably even have metrics to compare it with TPUs once I have some quota.
This was the error I ran into as I tried to change the accelerator type.
@StateGovernment that's because you need to set the accelerator count to minimum of 8, so if you set the accelerator count to 8 with TPU, it should work.
@entrpn The accelerator count was by default set to 8, and I only had 8 limited TPU quota for my account. I tried to change the count to 6 through cli but it didn't let me, so the count is hard-set to 8 from what I believe. Training still stops after 11mins, let me attach a screenshot of what I see on console when the training stops.
@entrpn I've successfully launched a training job with A100 changing the configuration as suggested above, but there was almost no activity in the console or logs, it took almost 25mins and it still says in progress with 0 activity. Please refer to the screenshots below, along with CPU utilisation and logs at the very end. Please help.
@StateGovernment I forgot to add another step, the container doesn't install cuda drivers, so it won't use the GPU, and will be extremely slow. You'll need to change (this line)[https://github.com/entrpn/serving-model-cards/blob/main/training-dreambooth/Dockerfile#L1] to something like:
FROM nvidia/cuda:11.3.1-base-ubuntu20.04
At this point, you might need to make extra modifications to the Dockerfile, you can look at (this)[https://github.com/entrpn/serving-model-cards/blob/main/stable-diffusion-batch-job/Dockerfile] dockerfile for reference.
@entrpn I see, I somehow missed that detail too, thank you for pointing out.
I also believe this line needs to change. Am not sure what to change it to though, please help me out.
I might even end up making a different Dockerfile altogether for GPUs.
@entrpn I've followed the instructions above but the training wouldn't start at all. please refer to screenshots below, I've also attached the Dockerfile I've used to build, and config to launch the job. please help.
Dockerfile
FROM nvidia/cuda:11.3.1-base-ubuntu20.04
RUN apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:deadsnakes/ppa && \
apt-get update && \
apt install -y python3.8 && \
apt-get -y install python3-pip
RUN apt-get update && apt-get -y upgrade \
&& apt-get install -y --no-install-recommends \
git \
wget \
g++ \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y curl
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | \
tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
tee /usr/share/keyrings/cloud.google.gpg && apt-get update -y && apt-get install google-cloud-sdk -y
# RUN pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip install git+https://github.com/huggingface/diffusers.git
RUN pip install transformers flax optax torch torchvision ftfy tensorboard modelcards
WORKDIR 'training_dreambooth'
COPY . .
Config used to launch training-job
custom_job = {
"display_name": "training-dreambooth-alisha-1000steps",
"job_spec": {
"worker_pool_specs": [
{
"machine_spec": {
# "machine_type": "cloud-tpu",
# "accelerator_type": aiplatform.gapic.AcceleratorType.TPU_V3,
# "accelerator_count": 8,
"machine_type": "a2-highgpu-1g",
"accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_A100,
"accelerator_count": 1,
},
"replica_count": 1,
"disk_spec" : {
"boot_disk_type": "pd-ssd",
"boot_disk_size_gb" : 500
},
"container_spec": {
"image_uri": "gcr.io/dreamboothtest/training-dreambooth-new-gpu:latest",
"command": [],
"args": [],
"env" : [
{"name" : "MODEL_NAME", "value" : "runwayml/stable-diffusion-v1-5"},
{"name" : "INSTANCE_PROMPT", "value" : "a photo of al45 person"},
{"name" : "GCS_OUTPUT_DIR", "value" : "gs://alishadreamboothtest"},
{"name" : "RESOLUTION", "value" : "512"},
{"name" : "BATCH_SIZE", "value" : "1"},
{"name" : "LEARNING_RATE", "value" : "1e-6"},
{"name" : "MAX_TRAIN_STEPS", "value" : "1000"},
{"name" : "HF_TOKEN", "value" : "<>"},
{"name" : "CLASS_PROMPT", "value" : "A photo of a person"},
{"name" : "NUM_CLASS_IMAGES", "value" : "56"},
{"name" : "PRIOR_LOSS_WEIGHT", "value" : "1.0"},
{"name" : "GCS_INPUT_DIR", "value" : "gs://alishadreamboothtest/training_images"},
]
},
}
],
"enable_web_access" : True
},
}
the reason why your job completes is because the base TPU image knows to find main.sh as the entrypoint. Add this to the end of your Dockerfile:
ENTRYPOINT ["./main.sh"]
This should start the job
How do I change the default accelerator type used for Dreambooth training.
Simply changing the following line is throwing me a cascade of RPC errors, please point me towards a way. https://github.com/entrpn/serving-model-cards/blob/cd3cd107c435ef0fa47f352f104e788265842f0e/training-dreambooth/gcp_run_train.py#L21