PKU-EPIC / MaskClustering

70 stars 2 forks source link

Problems with evaluation script and scannetpp #2

Closed rfsantacruz closed 3 months ago

rfsantacruz commented 3 months ago

Congratulations on the amazing work and thank you for sharing this user-friendly code base.

I would like to ask for help with a few issues:

1) It appears that each scene contains more iPhone/RGB images than the number of cameras defined in their iPhone/colmap files. Is this correct, or there is something wrong with my setup? If it is correct, could this be the reason for the segmented point clouds missing pieces of the sampled point cloud?

2) Following the provided instructions, I was able to set up the ScanNetPP dataset and run the run.py script up to the class-agnostic evaluation step. However, I'm encountering issues with this evaluation. It outputs NaN for all classes except for doors. Despite this, the setup and code seem correct, as I can visualize the segmentation results and they look good. I have also tried evaluating single scenes to simplify the problem, but I only get metrics for the door class. I have attached the ground truth and prediction file for a scene in case it helps in identifying the problem.

a24f64f7fb.zip

Could you please assist me with these issues? Your help would be greatly appreciated.

MiYanDoris commented 3 months ago

Thanks for your feedback!

Since I need to check the data details about your first question, I'll answer your second question first : )

Don't worry, this is the expected behavior. Since in the class-agnostic evaluation setting, we ignore the correctness of class lables, so we simply change the categories of all gt & predict instances to the first class in the list, i.e., door. The evaluation script will output NaN for all other categories with no instances existing in the scenes, so you see it outputs NaN for all classes except for doors.

Actually, this is the easiest way to do this. We only need to add four lines (line 261, 262, 282, 283) to change the standard semantic segmentation script.

MiYanDoris commented 3 months ago

Regarding the first question, I share your observation that the number of annotated frames in 'images.txt' is much smaller than the number of RGB frames. This is likely because:

  1. Neighboring frames are nearly identical, so only 1 out of every 10 frames is kept.
  2. According to the ScanNet++ paper, iPhone frames are filtered out if the average depth difference between iPhone depth and laser scan depth is greater than 0.3m.

As for the missing pieces in the segmented point cloud, there are two main reasons:

  1. As you mentioned, some regions are poorly scanned by the RGB images, so they aren't captured in the 2D masks or the final results. This typically happens in occluded areas, such as a wall behind a sofa. I empiracally find that adding more neighboring frames won't help with this issue.
  2. Following OVIR-3D, we use a point filtering strategy in the 'filter_point' function in 'utils/post_process.py' to remove points with a low detection ratio. You can decrease the 'point_filter_threshold' in the config to retain more points, but this will also include more noise, so you'll need to balance this trade-off.

Hope this helps! Let me know if you have any other questions.

rfsantacruz commented 3 months ago

thanks for your answer. But the regions that I am talking about are not these ones. They are actually visible big chunks. For example, in scene a24f64f7fb, the region shown below can be seen in the mesh_aligned_0.05.ply file, the sampled point cloud, and the ground truth instance segmentation point cloud. image image

However, it is missing in the segmentation obtained by your method. See the images below: image image

I was wondering why this happened. Upon checking my dataset, I noticed that for frames 600-1580, which cover this region, I do not have depth images in a24f64f7fb/iPhone/render_depth, nor do I have corresponding lines in the COLMAP files at a24f64f7fb/iPhone/colmap. I only have RGB frames in a24f64f7fb/iPhone/rgb. image

I would like to know if this issue is due to a mistake I made during setup or if it is actually a problem with the dataset. Also, do you think it can impact on the evaluation for this scene?

MiYanDoris commented 3 months ago

Thank you for your detailed visualization. Now I understand the problem you're facing. I found a similar issue in the 'ScanNetpp' dataset (link), indicating that this might be a problem with the dataset itself. For the specific scene a24f64f7fb, you could try running COLMAP again. This region is not entirely textureless, so COLMAP should be able to register the frames.

Regarding the impact of this dataset issue on our work, it will certainly affect the performance of our method in this particular scene. However, since this failure occurs only in a few scenes within ScanNetpp and we rarely encounter such severe registration failures in our real deployments, we believe this may not significantly impact our overall work.

Hope this helps!

MiYanDoris commented 3 months ago

By the way, lines for frames 600-1580 are also missing in my ScanNetpp dataset, so this is definitely not a mistake you made during setup :)

rfsantacruz commented 3 months ago

Thanks for helping. Very good work as well. I think your model would be able to get even better performance. Thanks again. I will close this issue.