Closed mingyuShin closed 7 months ago
Thank you for your great work!
I'm a beginner in this field. When measuring evaluation ranges (i.e., short, long), shouldn't we measure both in one model and publish it in the paper? Did FIERY, for example, train two models with different resolutions for each range and measure performance?
Hi,
Since the resolutions are different, the model could not handle the two range settings with the same weights. So even with the same gridmap size, it need to be trained separately. In this respect FIERY and PowerBEV are the same.
Thank you for your fast reply!
When I evaluated the pretrained checkpoint of FIERY directly, I ran inference for 1 epoch (5119 iterations), and both short and long values were outputted simultaneously, matching the numerical values in the paper. Upon inspecting FIERY's evaluation.py code, it appears that a single model infers 100m x 100m regions and then evaluates them by dividing them into 100m x 100m and 30m x 30m sections. This differs slightly from the reporting method of PowerBEV. I'm just seeking clarification. The evaluation is a little different, right?
I measured the performance of a pretrained model that has a grid resolution of 0.5m over an area of 100mx100m on a smaller area of 30mx30m. The performance results are as follows:
Testing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5119/5119 [1:00:16<00:00, 1.42it/s]
========================== Metrics ==========================
val_iou_background: 0.979128360748291
val_iou_dynamic: 0.618949830532074
val_pq_dynamic: 0.5224682688713074
val_sq_dynamic: 0.7617481350898743
val_rq_dynamic: 0.6858805418014526
val_denominator_dynamic: 85233.5
========================== Runtime ==========================
perception_time: 0.6085988915869898
prediction_time: 0.030737586728375215
postprocessing_time: 0.021788783454368073
total_time: 0.6611252617697331
=============================================================
Testing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5119/5119 [1:00:17<00:00, 1.41it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss/flow_uncertainty': -1.498396635055542,
'test_loss/instance_flow': 2.249345541000366,
'test_loss/segmentation': 2.8715157508850098,
'test_loss/segmentation_uncertainty': 1.2302122116088867,
'vpq': 0.5224682688713074}
--------------------------------------------------------------------------------
When I evaluated the pretrained checkpoint of FIERY directly, I ran inference for 1 epoch (5119 iterations), and both short and long values were outputted simultaneously, matching the numerical values in the paper. Upon inspecting FIERY's evaluation.py code, it appears that a single model infers 100m x 100m regions and then evaluates them by dividing them into 100m x 100m and 30m x 30m sections. This differs slightly from the reporting method of PowerBEV. I'm just seeking clarification. The evaluation is a little different, right?
Thanks for the correction. Yes, you are right. I re-compared our evaluation with FIERY's. The original FIERY used a strategy of single-range prediction and then trimming the results. And the FIERY‡ (repr.) that we reimplemented in our paper uses the same evaluation strategy as the existing PowerBEV. Although I don't think this will make an essential difference, for a fair comparison I would recommend that you follow FIERY's evaluation strategy as well as for PowerBEV.
Thank you for sharing your opinion. Since it is mentioned FIERY‡ (repr.) in the implementation section of your paper, I also think there should be no problem. Thank you for the quick response!
Thank you for your great work!
I'm a beginner in this field. When measuring evaluation ranges (i.e., short, long), shouldn't we measure both in one model and publish it in the paper? Did FIERY, for example, train two models with different resolutions for each range and measure performance?