OpenMask3D / openmask3d

MIT License
173 stars 12 forks source link

Evaluation on Scannet200 #18

Closed Bozcomlekci closed 5 months ago

Bozcomlekci commented 5 months ago

Greetings,

When I run the evaluation code on the ScanNet200 validation set, I think I obtained a different set of results. I didn't change any kind of hyperparameter or given config files except the paths to the correct locations of the data files. I attached the printout of the results.

eval.txt

I used the provided mask proposal model trained on the ScanNet200 training set. I performed the evaluation on the 312 scans belonging to the eval split. Is there something wrong with my evaluation?

Thanks in advance.

aycatakmaz commented 5 months ago

Hi Batuhan,

In the past months several interested users have confirmed to us that they were able to reproduce our results, so I believe there could be an issue with your evaluation setup. In order for us to be able to help and understand the issue better, would it be possible for you to share with us the RGB-D image dimensions of the ScanNet images you are using, as well as the camera intrinsics? Furthermore, have you followed the pre-processing step from our README (titled Step 1: Download and pre-process the ScanNet200 dataset)? This could be important for ensuring that mask proposals are the same as in our experiments. If you confirm that you have closely followed the pre-processing step, it would be very helpful for us to see a visualization of the predicted instance masks, to see if there is an issue with the mask proposals, or the feature computation stage. If the masks seem reasonable, I recommend you to enable the "save_crops" option in our config, as this will help with visually inspecting whether the image crops are correct/reasonable.

Best, Ayca

Bozcomlekci commented 5 months ago

For example, in the scene scene0011_00,

RGB-D image dimensions of the ScanNet images are like: data_compressed/color/*.jpg: 1296 × 968 pixels data_compressed/depth/*.png: 640 × 480 pixels

Intrinsics: data/intrinsic/intrinsic_color.txt:

1169.621094 0.000000 646.295044 0.000000
0.000000 1167.105103 489.927032 0.000000
0.000000 0.000000 1.000000 0.000000
0.000000 0.000000 0.000000 1.000000

data/intrinsic/intrinsic_depth.txt:

577.590698 0.000000 318.905426 0.000000
0.000000 578.729797 242.683609 0.000000
0.000000 0.000000 1.000000 0.000000
0.000000 0.000000 0.000000 1.000000

data/pose/*.txt:

0.606497 0.359513 -0.709163 5.898605
0.793947 -0.321582 0.515978 1.464963
-0.042553 -0.875977 -0.480473 1.329018
0.000000 0.000000 0.000000 1.000000

The extrinsics are identity.

I am using the point cloud file scene0011_00_vh_clean_2.ply

The evaluation scans are under a folder named scans, i.e. dataset/scans/scene0011_00, and scene0011_00 contains the structure mentioned in the "Step 2: Check the format of ScanNet200 dataset" part of your repo as well as the raw .sens files that the RGB-D data extracted from. The dataset/scans folder also contains the training scans but their RGB-D data is not available inside their scene folders as I don't use them in the validation.

data/processed/scannet                              <- the ScanNet200 pre-processing  out folder          
 ├── instance_gt
 │     ├── train 
 │     │      ├── scene0000_00.txt               
 │     │       ...                     
 │     ├── validation                 
 │     │      ├── scene0011_00.txt                 
 │     │       ...         
 ├── train
 │     ├── 0000_00.npy 
 │      ...
 ├── validation
 │     ├── 0011_00.npy
 │      ...
 └── color_mean_std.yaml <- mean: - 0.478 - 0.430 - 0.375 std: - 0.283 - 0.276 - 0.270  (3 sig. fig.)
 └── label_database.yaml <- 1400 lines of text starting with 1: color: - 174.0 - 199.0 - 232.0 name: wall validation: true
 └── train_database.yaml <- 22819 lines of text  containing color_mean, color_std, file_len, filepath, ..., scene, scene_type, subscene for each scene entry
 └── train_validation_database.yaml   <- 28747 lines of text 
 └── validation_database.yaml <- 5928 lines of text 

The mask proposal stage yields 148 masks for the provided example scene where I visualized each mask with a different color in the figure below.

instances

Furthermore, save_crops yields reasonable crops.

I still have no idea where the evaluation fails to calculate correct results. I suspect data/processed/scannet part as I am able to obtain reasonable results for the inference of a single scene.

If you could share your table printout for the evaluation, the output of run_eval_close_vocab_inst_seg.py it would be helpful. I'm obtaining a lot of nans and zeros in the table which I believe shouldn't be the case in a correct eval run.

Thanks.

aminebdj commented 5 months ago

Hey, @Bozcomlekci. I am also getting the same score on scannet200 for some reason, and 9.5mAP on replica. Could you please share if you figure out what the problem might be?

thanks

Replica results:

Replica results:

################################################################ what : AP AP_50% AP_25% ################################################################ basket : 0.000 0.000 0.264 bed : 0.000 0.000 0.000 bench : 0.000 0.000 0.000 bin : 0.485 0.563 0.565 blanket : 0.000 0.000 0.362 blinds : 0.025 0.062 0.246 book : 0.000 0.000 0.000 bottle : 0.000 0.000 0.000 box : 0.000 0.000 0.000 bowl : 0.000 0.000 0.000 camera : 0.000 0.000 0.000 cabinet : 0.370 0.556 0.556 candle : 0.000 0.000 0.000 chair : 0.376 0.462 0.462 clock : 0.000 0.000 0.188 cloth : 0.116 0.524 0.538 comforter : 0.000 0.000 0.667 cushion : 0.133 0.305 0.477 desk : 0.000 0.000 0.000 desk-organizer : 0.121 0.378 0.378 door : 0.293 0.332 0.478 indoor-plant : 0.044 0.133 0.133 lamp : 0.066 0.073 0.073 monitor : 0.000 0.000 0.000 nightstand : 0.556 0.833 0.833 panel : 0.000 0.000 0.000 picture : 0.375 0.375 0.375 pillar : 0.033 0.150 0.150 pillow : 0.131 0.362 0.564 pipe : 0.000 0.000 0.000 plant-stand : 0.000 0.000 0.000 plate : 0.000 0.000 0.000 pot : 0.460 0.517 0.517 sculpture : 0.061 0.273 0.273 shelf : 0.172 0.516 0.520 sofa : 0.287 0.541 0.544 stool : 0.216 0.216 0.560 switch : 0.000 0.000 0.000 table : 0.084 0.114 0.114 tablet : 0.000 0.000 0.271 tissue-paper : 0.000 0.000 0.000 tv-screen : 0.019 0.171 0.575 tv-stand : 0.000 0.000 0.000 vase : 0.135 0.153 0.324 vent : 0.000 0.000 0.000 wall-plug : 0.000 0.000 0.000 window : 0.000 0.000 0.000 rug : 0.000 0.000 0.000

average : 0.095 0.159 0.229

aycatakmaz commented 5 months ago

Hi everyone,

Sorry for the delay - I recently started an internship and I haven't had a lot of time to respond in the past weeks.

@Bozcomlekci I think there is some mismatch in the data. In my data folder for the scene scene0011_00, RGB-D image dimensions of the ScanNet images are like: data_compressed/color/.jpg: 640 × 480 pixels - the example scene also has images of this resolution for me. data_compressed/depth/.png: 640 × 480 pixels

The resolution of the image is different, and as far as I remember we do not resize the images within the script. So this might be messing things up due to the image crops potentially being wrong. They could still look reasonable, but they might not correspond 1-1 to the 3D instances due to this image resolution mismatch. I have been using a version of the ScanNet dataset that I got from a colleague, so maybe they already resized the images to match its size with the depth image resolution.

Regarding the masks, as we noted in the Readme, running OpenMask3D ScanNet evaluation on the first scene gives different results than running it on the example scene, as the ScanNet evaluation uses "eval on segments" configuration closely following Mask3D method, this is why one also needs to run the preprocessing script for Mask3D.

The mask proposal stage yields 165 masks for me for the first scene when one runs the ScanNet evaluation script, using the eval_on_segments option.

Here are the ScanNet200 results I get when I run the eval script: openmask3d_scannet200_final_results.txt

Finally, regarding the NaNs and 0.0s in the evaluation, it is normal to have a few of such lines in the evaluation. NaN means that that category does not exist in Validation set GT. On the other hand, 0.0 means that our model misses to correctly identify any instances that belong to that category.

Hope this helps, Ayca

Bozcomlekci commented 5 months ago

After changing the "color" image resolution, there is still a small discrepancy between my evaluation results and yours but overall they are aligned.

eval.txt

I would appreciate if you have any other suggestion. Is this discrepancy something expected?

hxiaoj commented 5 months ago

Hey, @Bozcomlekci. I have the correct resolution, and the crops is right, but I get a much different result, can you help me? inst_res.txt Thanks

aycatakmaz commented 5 months ago

Hi @hxiaoj and @Bozcomlekci,

For performing multiple rounds of SAM iterations, we do several steps of point sampling. I think in this version of the codebase we do not have a seed for this randomized point sampling process, which might cause small a discrepancy time to time depending on the sampled points that are used as an input to the SAM model. I would not be too surprised if the numbers are within +/- 1-1.5 AP points of what we reported in the paper.

Hope this helps, Ayca