Stability-AI / StableCascade

Official Code for Stable Cascade
MIT License
6.52k stars 531 forks source link

Can't train lora with local dataset; Parameter validation failed: Invalid bucket name "file:" #21

Closed k8tems closed 7 months ago

k8tems commented 7 months ago

According to the readme in train, the config supports local files.

webdataset_path:
  - s3://path/to/your/first/dataset/on/s3
  - file:/path/to/your/local/dataset.tar

However, when I run the training script, I get the following error and it seems like the script is stuck in an infinite loop trying to copy from aws.

output(config included):


**STARTIG JOB WITH CONFIG:**
adaptive_loss_weight: null
allow_tf32: true
backup_every: 1000
batch_size: 32
bucketeer_random_ratio: 0.05
captions_getter: null
checkpoint_extension: safetensors
checkpoint_path: /tmp/cascade/chk
clip_image_model_name: openai/clip-vit-large-patch14
clip_text_model_name: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
dataset_filters: null
dist_file_subfolder: ''
dtype: null
effnet_checkpoint_path: models/effnet_encoder.safetensors
ema_beta: null
ema_iters: null
ema_start_iters: null
experiment_id: stage_c_3b_lora
generator_checkpoint_path: models/stage_c_bf16.safetensors
grad_accum_steps: 4
image_size: 768
lora_checkpoint_path: null
lr: 0.0001
model_version: 3.6B
module_filters:
- .attn
multi_aspect_ratio:
- 1/1
- 1/2
- 1/3
- 2/3
- 3/4
- 1/5
- 2/5
- 3/5
- 4/5
- 1/6
- 5/6
- 9/16
output_path: /tmp/cascade/out
previewer_checkpoint_path: models/previewer.safetensors
rank: 4
save_every: 100
train_tokens:
- - '[mbl]'
  - ^cat</w>
training: true
updates: 10000
use_fsdp: false
wandb_entity: k8tems
wandb_project: StableCascade
warmup_updates: 1
webdataset_path:
- file:/tmp/mbl_2024_02_14_13_12.tar

------------------------------------

**INFO:**
adaptive_loss: null
ema_loss: null
iter: 0
total_steps: 0
train_tokens: null
wandb_run_id: pegmc3ny

------------------------------------

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess']

Parameter validation failed:
Invalid bucket name "file:": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Training with batch size 32 (8/GPU)
['dataset', 'dataloader', 'iterator']
**DATA:**
dataloader: DataLoader
dataset: WebDataset
iterator: Bucketeer
training: NoneType

------------------------------------

Unknown options: -

Unknown options: -

Unknown options: -
/usr/local/lib/python3.9/dist-packages/webdataset/handlers.py:34: UserWarning: OSError("(('aws s3 cp {  } -',), {'shell': True, 'bufsize': 8192}): exit 255 (read) {}", <webdataset.gopen.Pipe object at 0x7f55e963ae50>, 'pipe:aws s3 cp {  } -')
  warnings.warn(repr(exn))

The logs go on forever with most of it being, "Unknown options: -"

k8tems commented 7 months ago

I think I got it. I just had to change the webdataset_path config from a list to a string.

from:

webdataset_path: 
  - file:/tmp/mbl_2024_02_14_13_12.tar

to:

webdataset_path: file:/tmp/mbl_2024_02_14_13_12.tar