Can't reproduce the result

ZzZZCHS / Chat-Scene

Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)

MIT License

113 stars 8 forks source link

Can't reproduce the result #38

Closed WinterCodeForEverything closed 3 months ago

WinterCodeForEverything commented 3 months ago

I run the training and evaluation code and find the result on ScanRefer and Multi3DRefer are exectly bad, I'm not sure is there any bug in the code

Such as, I'm curious why obj_id are different between training and evalutation in prepare_scanrefer_annos.py, is this a bug?

ZzZZCHS commented 3 months ago

Did you change any training config such as the GPU number? Since the learning rate is multiplied by GPU number here, we find that the results are currently unproducible when setting GPU numer to 8 (we got the similar results as yours with very low grounding performance).

If you did not change any config, maybe you can try a lower max_grad_norm here. We use max_grad_norm=0.01 in recent experiments and it seems to be more stable.

This code snippet has no bug. Actually the "obj_id" in train annotation is not used during training, and the "obj_id" in val annotation is representing the id of the gt object in the gt segmentations, which will be used to retrieve the gt bbox for calculating the IoU.

ZCMax commented 3 months ago

Did you change any training config such as the GPU number? Since the learning rate is multiplied by GPU number here, we find that the results are currently unproducible when setting GPU numer to 8 (we got the similar results as yours with very low grounding performance).

If you did not change any config, maybe you can try a lower max_grad_norm here. We use max_grad_norm=0.01 in recent experiments and it seems to be more stable.

This code snippet has no bug. Actually the "obj_id" in train annotation is not used during training, and the "obj_id" in val annotation is representing the id of the gt object in the gt segmentations, which will be used to retrieve the gt bbox for calculating the IoU.

That's an interesting phenomenon, it seems like doubling the GPU number from 4 to 8 would have a huge impact on the grounding tasks? just because the 2x learning rate?

ZzZZCHS commented 3 months ago

The grounding and captioning tasks are all based on object ids. So the low performance indicates that the object id is not well-trained, and we can't directly observe this from the loss curve (the loss curve looks normal when it produces poor grounding results). We've did some ablation studies but still haven't found the true reason. Basically, I think a possible reason is about data scale. Current data (number and diversity) is not enough for feeding the trainable weights and the model may be trained towards a wrong direction that neglects the learning of object ids (the process could be vulnerable to learning rate). So we may plan to add more data in the future, especially those designed for learning object ids.

Nonetheless, It's not that hard to reproduce the reported results if you stick to our training config.

WinterCodeForEverything commented 3 months ago

Thanks for your explain, I didn't change the GPU number but do the following change in run.sh because I can't install slurm

In the evaulation stage it hints: WARNING 2024-08-07T14:34:51 | py.warnings: /data/projects/15003900/Chat-3D-v2/utils/distributed.py:18: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. builtin_warn(*args, **kwargs) Is this the reason?

ZzZZCHS commented 3 months ago

It looks no problem. Can I see your training log (the train.log file under output directory)?

I'm re-training this code today to see if it can be reproduced in my environment now. Maybe we can compare the training log to find some difference.

WinterCodeForEverything commented 3 months ago

Oh, I change the training epoch from 3 to 2, the learning rate seems to drop faster than before, there is the training log: train.log But I train the code in 3 epoch with the setting add_scene_token=True, and get the similiar result( very poor grounding peformance), there is the training log: train.log

seems I can't change any setting in the config, maybe I should try max_grad_norm=0.01 to make it more stable. If it's still so unpredictable, maybe there are some problems unsovled in the code or in the design?

ZzZZCHS commented 3 months ago

I used the original config and have trained it for 2 epochs (out of 3). The results are reproducible in my environment. train.log

The add_scene_token should be set to False as default. Because we didn't use it in our experiments and we also didn't mention it in the readme instructions.

I recommond training it with default setting first. If it is still unreproducible in your environment, we can further help you find the problem.

WinterCodeForEverything commented 3 months ago

Thanks for your patience, I reproduce the results with the default setting, it's a good work.