Question for speed up - Githubissues

Thank you for your great work! In Table 2 in the paper, I see that the pruned DeiT-Tiny can speed up the throughput from 2648.7 to 4496.2. But in my local test, I found that the pruned DeiT-Tiny's throughput (1819) is similar to the original DeiT-Tiny (1760). I use the provided compressed DeiT-Tiny model (Acc@1: 71.6, https://drive.google.com/file/d/1NSq3SRxnObfl6oaFE5gHtjnhzm0Lfc6S/view?usp=sharing). My environment is RTX 3090 and the throughput code is below:

@torch.no_grad()
def throughput(data_loader, model, local_rank):
    model.eval()

    for idx, (images, _) in enumerate(data_loader):
        images = images.cuda(non_blocking=True)

        batch_size = images.shape[0]
        for i in range(50):
            model(images)
        torch.cuda.synchronize()
        tic1 = time.time()
        for i in range(30):
            model(images)
        torch.cuda.synchronize()
        tic2 = time.time()
        throughput = 30 * batch_size / (tic2 - tic1)
        if local_rank == 0:
            print("throughput averaged with 30 times")
            print(f"batch_size {batch_size} throughput {throughput}")
        return

I wonder if I did something wrong. Would you mind sharing your code for testing throughput? Thanks a lot.

Daner-Wang / VTC-LFC

Question for speed up #3