cyber-meow / anime_screenshot_pipeline

A 99% automatized pipeline to construct training set from anime and more for text-to-image model training
MIT License
178 stars 10 forks source link

Very slow deliting files 1 stage #51

Open dill-shower opened 2 months ago

dill-shower commented 2 months ago

At the first stage, the script quickly calculates embeddings and proceeds to deletion. But the deletion of files is very slow. The average speed is 7 files per second, while deleting about 50 thousand files. As a result, deleting similar screenshots alone takes several hours

cyber-meow commented 2 months ago

Hello,

This is not normal. I quickly tested on both my laptop (with 3070ti) and personal server (with 4090). On both I get around 11it/s. Knowing that I am processing by default 16 images per batch (set with detect_duplicate_batch_size), this means we are processing around 100 images per second. Processing 6k images is effectively done within 1 minute. As I am using mobilenetv3 here the speed should not be an issue. The second stage is generally much slower and can effectively take hours (I am not sure if the cropping has been improved from waifuc since then).

Screenshot from 2024-04-30 08-16-47

dill-shower commented 2 months ago

Perhaps I have not described the problem clearly enough. I am satisfied with the speed of calculation of embeddings, but after calculating them and forming a list of files to delete in the code, deleting them is very slow. Perhaps a screenshot will make it clearer.

KsJb

12 minutes to delete 6,000 files is a lot and this particular step takes me 80% of the time. Technical information: I am using a fast ssd and other programs can delete a thousand files per second(but they are not suitable for the purpose of computing similar screenshots). I use WSL. The files are on Windows OS ssd. I tried moving them to the wsl file system, but it didn't give any meaningful speed gain at this point

cyber-meow commented 2 months ago

Normally this part is done mostly instantaneously. It is just calling some basic function as below

        for sample_id in tqdm(samples_to_remove):
            img_path, _ = dataset[sample_id]
            os.remove(img_path)
            related_paths = get_related_paths(img_path)
            for related_path in related_paths:
                if os.path.exists(related_path):
                    os.remove(related_path)

It is hard to say why this is the case. You may want to run profiling to see where this comes from.