Open dill-shower opened 2 months ago
Hello,
This is not normal. I quickly tested on both my laptop (with 3070ti) and personal server (with 4090). On both I get around 11it/s. Knowing that I am processing by default 16 images per batch (set with detect_duplicate_batch_size
), this means we are processing around 100 images per second. Processing 6k images is effectively done within 1 minute. As I am using mobilenetv3 here the speed should not be an issue. The second stage is generally much slower and can effectively take hours (I am not sure if the cropping has been improved from waifuc since then).
Perhaps I have not described the problem clearly enough. I am satisfied with the speed of calculation of embeddings, but after calculating them and forming a list of files to delete in the code, deleting them is very slow. Perhaps a screenshot will make it clearer.
12 minutes to delete 6,000 files is a lot and this particular step takes me 80% of the time. Technical information: I am using a fast ssd and other programs can delete a thousand files per second(but they are not suitable for the purpose of computing similar screenshots). I use WSL. The files are on Windows OS ssd. I tried moving them to the wsl file system, but it didn't give any meaningful speed gain at this point
Normally this part is done mostly instantaneously. It is just calling some basic function as below
for sample_id in tqdm(samples_to_remove):
img_path, _ = dataset[sample_id]
os.remove(img_path)
related_paths = get_related_paths(img_path)
for related_path in related_paths:
if os.path.exists(related_path):
os.remove(related_path)
It is hard to say why this is the case. You may want to run profiling to see where this comes from.
At the first stage, the script quickly calculates embeddings and proceeds to deletion. But the deletion of files is very slow. The average speed is 7 files per second, while deleting about 50 thousand files. As a result, deleting similar screenshots alone takes several hours