Nota-NetsPresso / BK-SDM

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ECCV'24]
Other
238 stars 16 forks source link

Loading preprocessed_212k laion dataset without any response in terminal #59

Closed MqLeet closed 2 months ago

MqLeet commented 4 months ago

Hi @bokyeong1015 , thanks for your great work!

I modified diffusers/train_text_to_image.py and used your fine-tuning strategy: on 212k subset of laion. But when I run the training code, loading dataset will consume too much time and there is no response in the terminal after even 40 minutes.... Is it caused by the large number of images or some bugs in my code?

    # In distributed training, the load_dataset function guarantees that only one local process can concurrently
    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
            data_dir=args.train_data_dir,
        )
    else:
        data_files = {}
        if args.train_data_dir is not None:
            data_files["train"] = os.path.join(args.train_data_dir, "**")
        print("*** load dataset: start")
        t0 = time.time()
        dataset = load_dataset(
            "imagefolder",
            # data_files=data_files,
            cache_dir=args.cache_dir,
            split="train",
            data_dir=args.train_data_dir,
        )
        print(f"*** load dataset: end --- {time.time()-t0} sec")

        # See more about loading custom images at
        # https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.

    # column_names = dataset["train"].column_names

    ##############################################################################################
    column_names = dataset.column_names
    image_column = column_names[0]
    caption_column = column_names[1]
    ###################################################################################################

This is the loading dataset code. How much time will 'load_dataset' function cost?

Thanks for your great work, looking forward to your reply!

Best wishes, Qianli

MqLeet commented 4 months ago

My datasets version is 2.19.0