cap-lab / jedi

Jetson embedded platform-target deep learning inference acceleration framework with TensorRT
GNU General Public License v2.0
23 stars 5 forks source link

Reproduce the FP16 table results #14

Open uelordi01 opened 1 year ago

uelordi01 commented 1 year ago

Hi @cap-lab : Hi I am trying to reproduce the FPS results of the table (FP16). I took yolov4-cfg with FP16 weight precission as an example . My question is about how this FPS values are calculated. I compiled the code using [tensorrt8_support branch] (https://github.com/cap-lab/jedi/tree/tensorrt8_support) and your modified tkdnn with tensorrt8_experiment My jetson AGX configuration is jetpack 4.6.2 with tensorrt 8.2.1.8

The output of the jedi/build/bin/proc gives me the followint output.

TENSORRT LOG: [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +113, now: CPU 44, GPU 684 (MiB)
stuckWhile(front thread: 0): 7828
device id: 0, stream_id: 0, executed_num: 1238
device id: 0, stream_id: 1, executed_num: 1238
device id: 0, stream_id: 2, executed_num: 1238
device id: 0, stream_id: 3, executed_num: 1238
stuckWhile(device id: 0): 1283401
device id: 1, stream_id: 0, executed_num: 1239
device id: 1, stream_id: 1, executed_num: 1237
device id: 1, stream_id: 2, executed_num: 1238
device id: 1, stream_id: 3, executed_num: 1238
stuckWhile(device id: 1): 1266358
device id: 2, stream_id: 0, executed_num: 1238
device id: 2, stream_id: 1, executed_num: 1238
device id: 2, stream_id: 2, executed_num: 1238
device id: 2, stream_id: 3, executed_num: 1238
stuckWhile(device id: 2): 1237383
stuckWhile(back thread: 0): 1204581

inference time: 209.684
average latency (0): 79686

Digging in the code i see that 209.684 inference time are the timestamp from start and end time in program execution (proc)

    start_time = getTime();
    for(int iter = 0; iter < instance_num; iter++) {
        instance_threads.push_back(std::thread(runInstanceThread, &(instance_threads_data[iter]))); 
    }

    for(int iter = 0; iter < instance_num; iter++) {
        instance_threads[iter].join();  
    }
    inference_time = (double)(getTime() - start_time) / 1000000;
    std::cout<< std::endl <<"inference time: "<<inference_time<<std::endl;

The other value is average latency calculated in microseconds according to _getAverageLatency(iter, &configdata, latencies[iter])

So, I guessed that the FPS values could be calculated using inference time. 1/(209.684/(4952 images)) -> 23 FPS

The other approach for FPS I though is to make 1/(average latency (milliseconds) which is 79.68 ms -> so FPS -> 1/0.07968 -> 12.55 FPS

There are 128FPS for YoloV4 approach in the table and my FPS calculation approaches are (23FPS and 12.55FPS), so they are far from where I should be expected.

For this reason, could you give some hint of how to calculate the FPSs to reproduce the results? Thank you in advance. Unai.

urmydata commented 1 year ago

Hi,

The FPS is calculated as # of images / total time as your first approach.

The yolov4.cfg file that you mentioned is not the configuration file we used for the final result. Please check Table 5 in our paper.

Also, we used the MAXN power mode with the fixed frequency by the _jetsonclocks command. In addition, you need to explore technique parameters on your board with Jetpack 4.6. And since your Xavier board's Jetpack version is different, it may be difficult to reproduce the same result. As I remember, in Jetpack 4.6, the result was slow.

The below video shows the result of our experiment (I have run it just before). https://github.com/cap-lab/jedi/assets/20039661/3eb58a84-80e1-4aa1-9c0c-b07a51d6bf8f The result shows about 40 seconds for inference time, which indicates about 124 FPS (4952/40).

Thanks

uelordi01 commented 1 year ago

Ok, understood. Thank you for the clarification :) : So I changed the configuration based on the table 5 and the PND-A (2pipeline stages (PEs (2dla,GPU)) (in the paper). And I get more or less 80 FPS. The idea was not to reproduce exactly the experiments, I just wanted to be sure that my FPS calculation was fine and the configuration files that I was creating were the correct ones. Here the configuration that I used following the paper table 5 results (FOR YOLOV4). If I am missing something please tell me. Otherwise you can close the issue.

configs = { instance_num = "1" instances = ( { network_name="yolo4";

model_dir = "/sdcard/chjej202/models_temp/";

        model_dir = "/media/jetson/SD/uelordi_experiments/jedi/data/bin/yolo4";
        bin_path = "/media/jetson/SD/uelordi_experiments/jedi/data/bin/yolo4";
        cfg_path = "/media/jetson/SD/uelordi_experiments/jedi/data/cfg/yolo4_relu.cfg";
        image_path = "/media/jetson/SD/uelordi_experiments/jedi/paper_experiments/experiment_images.txt"
        calib_image_path = "/media/jetson/SD/uelordi_experiments/jedi/data/all_images.txt";
        calib_images_num = "100";
        calib_table = "/sdcard/chjej202/models2/yolov4/model416x416_0.268_DLA_INT8_1-calibration.table";
        name_path = "/media/jetson/SD/uelordi_experiments/jedi/data/coco.names";
        batch = "1"
        offset = "0";
        sample_size = "4952"

        device_num = "2"
        pre_thread_num = "1"
        post_thread_num = "1"
        buffer_num = "5"
        cut_points = "82,268"
        streams = "4,2"
        devices = "DLA,GPU"
        dla_cores = "0,1"
        data_type = "FP16"
    }   
)

}

urmydata commented 1 year ago

Hi,

Can you change _dlacores = "0,1" to _dlacores = "2,1"? dla_cores = "0,1" means that the DLA 0 is used for the first stage. To use the PND technique, dla_cores is needed to be changed like _dlacores = "2,1". If the value is greater than or equal to the number of cores (the number of DLAs), then the PND technique is applied.

Thanks