Training failed on TPUs for Dreambooth

StateGovernment commented 1 year ago

Followed all instructions by the book, had images in a local images folder as-well. Training job stopped after 6 minutes with an error "replica workerpool0-0 exited with a non-zero status of 1", I had a TPU quota of 25 so they wouldn't run out as-well.

Screenshot of the Training Job from Console.

Here's a csv of logs for reference. downloaded-logs-20230317-023353.csv

CLI command I've used to trigger the job. python3.8 gcp_run_train.py --project-id=dreamboothtest --region=us-central1 --image-uri=gcr.io/dreamboothtest/training-dreambooth:latest --gcs-output-dir=gs://dreamboothmodelstore --instance-prompt="a photo of qw23 person" --hf-token="<>" --class-prompt="A photo of a person" --max-train-steps=800

please help

entrpn commented 1 year ago

Let me take a look today and get back to you soon!

entrpn commented 1 year ago

@StateGovernment I updated the repo, please pull it and try again.

StateGovernment commented 1 year ago

@entrpn Training works perfectly with the new version, thank you. Although the environment needed to run inferencing on TPU is not provided as of now, I am trying to assume environment from DockerFile given, but hasn't been successful yet. Could you also please freeze your pip and provide a requirements file just for Inferencing through Jax on TPU.

This is the error I get for inferencing on Colab, for reference.

Config used on Colab,

pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install git+https://github.com/huggingface/diffusers.git
pip install transformers flax optax torch torchvision ftfy tensorboard modelcards

Thank you so much for the training fix, closing this now as this was a training specific issue fixed.

entrpn commented 1 year ago

@StateGovernment I haven't tested on colab. Try running the following with a GPU colab environment or in a Vertex AI Workbench notebook with a T4 GPU (this is my environment).

Copy your model to model folder in colab or workbench.
Install dependencies

transformers==4.27.1
diffusers==0.14.0
jax==0.3.25
jaxlib==0.3.25
flax==0.6.4
torch==1.13.1

Call the script:

from diffusers import StableDiffusionPipeline
import torch
import jax

pipe = StableDiffusionPipeline.from_pretrained("./model", safety_checker=None, from_flax=True, dtype=torch.float16).to("cuda")

prompt = "a photo of sks man wearing a suit and sunglasses, highly detailed, close up shot"
negative_prompt="rendered, unrealistic, work of art, artistic, cinematic"

image = pipe(prompt,negative_prompt=negative_prompt).images[0]

image.save("output.png")

This script will automatically convert flax/jax weights to torch and run inference. If you want to save the weights in torch format, just run:

from diffusers import StableDiffusionPipeline
import torch
import jax

pipe = StableDiffusionPipeline.from_pretrained("./model", safety_checker=None, from_flax=True, dtype=torch.float16).to("cuda")

# Save torch weights
pipe.save_pretrained("model_torch")

Then you can load the pytorch model directly, instead of converting it every time you run the script with a GPU.

pipe = StableDiffusionPipeline.from_pretrained("./model_torch", safety_checker=None, dtype=torch.float16).to("cuda")

StateGovernment commented 1 year ago

Working perfectly on Colab GPU, is there any way we can replicate the same on TPUs? (colab or GCP VM)

StateGovernment commented 1 year ago

@entrpn I have a few questions on the training flow which are extremely crucial for our pipeline, can I upload images of different subjects each time I call gcp_run_train.py on step 4?

At which point do the images get uploaded for the training job? If the images actually get uploaded while building the container and pushing it? In that case do I have to build a new training container each time I want to train dreambooth on a different subject?

entrpn commented 1 year ago

@StateGovernment in this case, you'll need to modify train.py to read an environment variable like this line. The variable, as an example lets call it GCS_INPUT_DIR points to a gcs folder where the images are located, something like gs://bucket-id/training-images.

In gcp_run_train.py add that environment variable when you create the job, like this.

Finally, you'll have to modify the train.py to download the images when the container starts the training job.

For example for how to download images, take a look at this line. When the job is finished training, I push the model to gcs using python's subprocess but you can also do it with python's library google-cloud-storage.

Lastly, if you have too many images, you can use gcsfuse instead. All Vertex AI training jobs use gcs fuse to mount gcs directories as local filesystems. In this case, your environment variable would be something like /gcs/bucket-id/training-images and you would not need to copy the files over to the training job. I'm not sure if this works well with the diffusers library though.

StateGovernment commented 1 year ago

This is working perfectly, all I need to pass to is a gcs bucket address with training images, I am also assuming we can eliminate the need to download a model from Hugging face by pushing the model within the container itself, thus saving a few more seconds in a similar way.

Now I have a similar problem with Inferencing aswell, I believe I would not want to build an image every time I want to deploy a model, so would I be able to pass in bucket address for the model has it deployed directly from the bucket in a similar fashion?

Thank you for the guidance @entrpn

entrpn / serving-model-cards

Training failed on TPUs for Dreambooth #8