AayushKrChaudhary / RITnet

This is a winning model of OpenEDS Semantic Segmentation Challenge
MIT License
67 stars 27 forks source link

Question about the FPS of RITnet #11

Closed QJieWang closed 1 year ago

QJieWang commented 1 year ago

Hello, I have some questions about the speed of RITnet. The paper reported a speed of 301HZ on a 1080ti, but when I tested it on a 3090, the highest speed I achieved was only 191FPS. ce67123599595657d03e38b4ae313d9 I am very curious about this huge difference. Is the speed of 301FPS the result of further model compression? Here is the code I used to calculate the FPS.

import numpy as np
import time
import torch
from model import model_dict
import os

model = model_dict["densenet"]

model = model.to(device="cuda:1")
# random input
dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float).to(device="cuda:1")
_ = model(dummy_input)

model.eval()
# torch.cuda.synchronize()

# Run model, calculate FPS
num_iterations = 100  
total_time = 0
for i in range(num_iterations):
    start_time = time.time()
    output_tensor = model(dummy_input)
    end_time = time.time()
    total_time += end_time - start_time

fps = num_iterations / total_time
print("FPS: {:.2f}".format(fps))
QJieWang commented 1 year ago

Actually, there is a third-party speed test for RITnet. In "Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction," it was mentioned that they tested RITnet on a 1080ti with 1440 images, which took 22.75 seconds, roughly equivalent to around 63.3 FPS. Therefore, I'm very curious about how the segmentation speed of 301 Hz mentioned in the paper was achieved. Was the model compressed or quantized? Or, perhaps, was the batch size mistakenly included when calculating FPS? image

gabrielDiaz-performlab commented 1 year ago

Hello, qjiewang!

I’m not really able to answer your question myself, but I have forwarded it to the corresponding authors of that manuscript.

I don’t think it was a typo, because we did have it running in real time on a Pupil Labs mobile unit at one point. I believe they Are 120 Hz per eye, that we had it running at 240 Hz at that time. I could be wrong, because the temporal frame rate can be adjusted a bit in the pupil labs system.

Anyhow, both of the original corresponding authors have since moved on from the lab, and so it may be a little bit before they get back to us. In any case, thanks for reaching out.


From: QJieWang @.> Sent: Saturday, April 15, 2023 6:19 AM To: AayushKrChaudhary/RITnet @.> Cc: Subscribed @.***> Subject: Re: [AayushKrChaudhary/RITnet] Question about the FPS of RITnet (Issue #11)

Actually, there is a third-party speed test for RITnet. In "Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction," it was mentioned that they tested RITnet on a 1080ti with 1440 images, which took 22.75 seconds, roughly equivalent to around 63.3 FPS. Therefore, I'm very curious about how the segmentation speed of 301 Hz mentioned in the paper was achieved. Was the model compressed or quantized? Or, perhaps, was the batch size mistakenly included when calculating FPS? [image]https://user-images.githubusercontent.com/41409166/232207975-64de0372-8f9d-4758-900e-15c4d2fcaf9f.png

— Reply to this email directly, view it on GitHubhttps://github.com/AayushKrChaudhary/RITnet/issues/11#issuecomment-1509718971, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACEL7W4YTH7LBGZFZ4L3LMLXBJYZDANCNFSM6AAAAAAW7GUN64. You are receiving this because you are subscribed to this thread.Message ID: @.***>

RSKothari commented 1 year ago

Hello @gabrielDiaz-performlab , @QJieWang , I did respond to @QJieWang via email but it seems you weren't CC'ed in his original email. Please see my response as below.

I reviewed your code sample and have a suggestion. Please consider changing your dummy data to torch.float32. As far as I recall, the model was designed to work with 32-bit eye images (or possibly 16-bit images—I'm not certain, but Aayush can confirm this). Your approach to computing FPS is also correct.

After casting your data to either 32-bit or 16-bit, please get back to us with the results.

As a follow-up to my previous email, I also recommend running approximately 10k iterations and recording the time intervals in a list. Once you've collected the delta time intervals, you can calculate the median duration and use it to report the FPS based on the median value. The median value is a more accurate measure to assess real time capabilities.

QJieWang commented 1 year ago

The RITnet is designed to work with 32-bit eye images. In Torch, torch.float32 and Torch.float are equivalent. For clarification, I modified the experimental code and output the data type of dummy_input during the experiment. However, the speed still did not change during the experiment. image I also added some additional metrics, such as the maximum, minimum, first, last, and median values of FPS to evaluate the model's FPS. However, unfortunately, I still cannot achieve a speed surpassing 191 FPS on 3090. image Here is my test code :

import numpy as np
import time
import torch
from model import model_dict
import os
from tqdm import tqdm

model = model_dict["densenet"]

model = model.to(device="cuda:2")
# random input
dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float).to(device="cuda:2")
# dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float32).to(device="cuda:2")
print(F"the Type of dummy_input is {dummy_input.dtype} ")
_ = model(dummy_input)

model.eval()
# torch.cuda.synchronize()

# Run model, calculate FPS
FPS_list = []
for test in tqdm(range(100)):
    torch.cuda.synchronize()
    num_iterations = 100
    total_time = 0
    for i in range(num_iterations):
        start_time = time.time()
        output_tensor = model(dummy_input)
        end_time = time.time()
        total_time += end_time - start_time

    fps = num_iterations / total_time
    FPS_list.append(fps)
FPS = np.array(FPS_list)
print("The First Test FPS: {:.2f}".format(FPS[0]))
print("The MAX FPS: {:.2f}".format(max(FPS)))
print("The MIN FPS: {:.2f}".format(min(FPS)))
print("The Last Test FPS: {:.2f}".format(FPS[-1]))
print("The Median FPS: {:.2f}".format(np.median(FPS)))
# if set dummy_input torch.float16 it will raise error
# Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
# random input
# dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float16).to(device="cuda:2")
# print(F"the Type of dummy_input is {dummy_input.dtype} ")
# _ = model(dummy_input)

# model.eval()
# # torch.cuda.synchronize()

# # Run model, calculate FPS
# num_iterations = 100
# total_time = 0
# for i in range(num_iterations):
#     start_time = time.time()
#     output_tensor = model(dummy_input)
#     end_time = time.time()
#     total_time += end_time - start_time

# fps = num_iterations / total_time
# print("FPS: {:.2f}".format(fps))
AayushKrChaudhary commented 1 year ago

The code looks good. The forward pass is 300 fps and it was tested multiple times. I have couple of suggestions. 1) Compute for around 10000 iterations instead of 100. 2) Ignore first few iterations time. I have seen the time to be different to initially start. If possible just check the end_time-start_time of the last completed iteration to have an understanding of what is last per frame computation.

Regardless, the time was computed in similar fashion as you did except the number of iterations were large.

Regarding comparison in the other papers, the test were mostly complete image reading, image preprocessing and the forward pass which drops the speed to around 60 fps. The code is in python and the speed can be improved using C++ and using half the resolution.

QJieWang commented 1 year ago
  1. Regarding the 10,000 iterations, although I do not believe that increasing the iterations from 100 to 10,000 can achieve a segmentation speed of over 300FPS from the maximum segmentation speed of 190FPS, if you think it needs to be proven, I have increased the number of iterations. In order to avoid interference from lower speeds, I selected the top 10 data for analysis from the 10,000 iterations. The fact is that even on the 3090, a speed of 300FPS cannot be achieved. image

  2. Regarding the experiments in other papers, which also achieved a speed of only about 63FPS on the same device (1080ti), They increased the image preprocessing and postprocessing. However, in fact, RITnet does not have any relevant operations for image postprocessing. "Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction" pursued a small number of model parameters, resulting in very poor segmentation results, so postprocessing was added to improve the segmentation result, which is different from RITnet. RITnet's segmentation result is very good and does not require postprocessing. Therefore, the measurement of RITnet's FPS only adds the operation of loading the image. I believe that this operation alone cannot reduce 300FPS to 63FPS, which is simply unrealistic.

  3. Finally, since the RITnet paper does not describe specific experimental operations, it only gives the conclusion that a segmentation speed of 301HZ was achieved on the 1080ti, which is very different from my experimental results. Although RITnet's segmentation result is very good and achieves top 1, personally, I tend to believe that when the author measured the FPS, the batch size was included in the calculation, resulting in an expansion of FPS. For example, sending 5 images to the 1080ti at a time and returning the result in real-time will result in FPS data being expanded by a factor of 5. Although this operation is reasonable and correct for related eye equipment, it does not conform to the definition of FPS in CV.

RITnet is a great work and has given me a lot of inspiration, but my doubts about the amazing 300FPS segmentation speed of RITnet have not been resolved. Here is my test code :

import numpy as np
import time
import torch
from model import model_dict
import os
from tqdm import tqdm
import pickle
model = model_dict["densenet"]

model = model.to(device="cuda:2")
# random input
dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float).to(device="cuda:2")
# dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float32).to(device="cuda:2")
print(F"the Type of dummy_input is {dummy_input.dtype} ")
_ = model(dummy_input)

model.eval()
# torch.cuda.synchronize()

# Run model, calculate FPS
FPS_list = []
for test in tqdm(range(10000)):
    torch.cuda.synchronize()
    num_iterations = 100
    total_time = 0
    for i in range(num_iterations):
        start_time = time.time()
        output_tensor = model(dummy_input)
        end_time = time.time()
        total_time += end_time - start_time

    fps = num_iterations / total_time
    FPS_list.append(fps)
# sort all the FPS,and take the top 10
FPS_list = sorted(FPS_list, reverse=True)[:10]
FPS = np.array(FPS_list)
print("The First Test FPS: {:.2f}".format(FPS[0]))
print("The MAX FPS: {:.2f}".format(max(FPS)))
print("The MIN FPS: {:.2f}".format(min(FPS)))
print("The Last Test FPS: {:.2f}".format(FPS[-1]))
print("The Median FPS: {:.2f}".format(np.median(FPS)))
# if set dummy_input torch.float16 it will raise error
# Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
# random input
# dummy_input = torch.randn(1, 1, 640, 400, dtype=torch.float16).to(device="cuda:2")
# print(F"the Type of dummy_input is {dummy_input.dtype} ")
# _ = model(dummy_input)

# model.eval()
# # torch.cuda.synchronize()

# # Run model, calculate FPS
# num_iterations = 100
# total_time = 0
# for i in range(num_iterations):
#     start_time = time.time()
#     output_tensor = model(dummy_input)
#     end_time = time.time()
#     total_time += end_time - start_time

# fps = num_iterations / total_time
# print("FPS: {:.2f}".format(fps))
RSKothari commented 1 year ago

@QJieWang Your approach towards computing FPS seems correct. I personally wouldn't be opposed if you reported RITnet FPS performance using the above mentioned approach. We will conduct another round of verification on our end at a later date and provide an updated RITnet FPS update on GIThub (especially regarding the batchsize). Since the FPS does not change our message, academic contributions or core concept, we still stand by RITnet's evaluation. We hope this unblocks you / your analysis.

QJieWang commented 1 year ago

Thank you very much, RITnet is an excellent work. I apologize for bothering you.