Should I preprocess the SAM dataset?

``I am reproducing the result. For the SAM dataset download, I found this link: SAM

The dataset config file in this repo shows that it is divided into 4 subset. However, the original SAM dataset did not do that. Instead, it is just .jpg and .json file in the root folder.

Therefore, I am wondering if I should write a script to split the dataset into 4 subsets. Here is my current code for doing that:

import os
import shutil

source_dir = 'SAM'
folders = ['0000', '0001', '0002', '0003']

for folder in folders:
    os.makedirs(os.path.join(source_dir, folder), exist_ok=True)

total_pairs = 11187

pairs_per_folder = total_pairs // 4
extra_pairs = total_pairs % 4  # This will be 3 in this case

folder_indices = [0, 1, 2, 3]
folder_counts = [pairs_per_folder] * 4

for i in range(extra_pairs):
    folder_counts[i] += 1

cumulative_counts = [sum(folder_counts[:i+1]) for i in range(len(folder_counts))]

for i in range(1, total_pairs + 1):
    # Determine which folder the current pair belongs to
    if i <= cumulative_counts[0]:
        folder = folders[0]
    elif i <= cumulative_counts[1]:
        folder = folders[1]
    elif i <= cumulative_counts[2]:
        folder = folders[2]
    else:
        folder = folders[3]

    # Move both the .jpg and .json files
    for ext in ['jpg', 'json']:
        filename = f'sa_{i}.{ext}'
        src = os.path.join(source_dir, filename)
        dst = os.path.join(source_dir, folder, filename)
        if os.path.exists(src):
            shutil.move(src, dst)
        else:
            print(f"File not found: {src}")

ali-vilab / AnyDoor

Should I preprocess the SAM dataset? #106