Training stops trying to train on PointOdyssey

lukaboljevic commented 1 month ago

Hello! Thank you for your awesome work first of all.

I am trying to train TAPTR on PointOdyssey, but I'm running into an issue during the first epoch - I believe even after running the first sample through the model. Here are the relevant lines from the log:

Log

``` ... Start training Loss is nan, stopping training {'pt_full_cardinality_error_dn': tensor(0., device='cuda:0') 'pt_full_cardinality_error_dn_0': tensor(0., device='cuda:0') 'pt_full_cardinality_error_dn_1': tensor(0., device='cuda:0') 'pt_full_cardinality_error_dn_2': tensor(0., device='cuda:0') 'pt_full_cardinality_error_dn_3': tensor(0., device='cuda:0') 'pt_full_cardinality_error_dn_4': tensor(0., device='cuda:0') 'pt_full_loss_bbox': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_0': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_1': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_2': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_3': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_4': tensor(nan, device='cuda:0') 'pt_full_loss_bbox_dn': tensor(0., device='cuda:0') 'pt_full_loss_bbox_dn_0': tensor(0., device='cuda:0') 'pt_full_loss_bbox_dn_1': tensor(0., device='cuda:0') 'pt_full_loss_bbox_dn_2': tensor(0., device='cuda:0') 'pt_full_loss_bbox_dn_3': tensor(0., device='cuda:0') 'pt_full_loss_bbox_dn_4': tensor(0., device='cuda:0') 'pt_full_loss_ce': tensor(1.0215, device='cuda:0') 'pt_full_loss_ce_0': tensor(1.0736, device='cuda:0') 'pt_full_loss_ce_1': tensor(1.2466, device='cuda:0') 'pt_full_loss_ce_2': tensor(1.1136, device='cuda:0') 'pt_full_loss_ce_3': tensor(1.2177, device='cuda:0') 'pt_full_loss_ce_4': tensor(1.2346, device='cuda:0') 'pt_full_loss_ce_dn': tensor(0., device='cuda:0') 'pt_full_loss_ce_dn_0': tensor(0., device='cuda:0') 'pt_full_loss_ce_dn_1': tensor(0., device='cuda:0') 'pt_full_loss_ce_dn_2': tensor(0., device='cuda:0') 'pt_full_loss_ce_dn_3': tensor(0., device='cuda:0') 'pt_full_loss_ce_dn_4': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn_0': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn_1': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn_2': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn_3': tensor(0., device='cuda:0') 'pt_full_loss_giou_dn_4': tensor(0., device='cuda:0') 'pt_full_loss_hw': tensor(nan, device='cuda:0') 'pt_full_loss_hw_0': tensor(nan, device='cuda:0') 'pt_full_loss_hw_1': tensor(nan, device='cuda:0') 'pt_full_loss_hw_2': tensor(nan, device='cuda:0') 'pt_full_loss_hw_3': tensor(nan, device='cuda:0') 'pt_full_loss_hw_4': tensor(nan, device='cuda:0') 'pt_full_loss_hw_dn': tensor(0., device='cuda:0') 'pt_full_loss_hw_dn_0': tensor(0., device='cuda:0') 'pt_full_loss_hw_dn_1': tensor(0., device='cuda:0') 'pt_full_loss_hw_dn_2': tensor(0., device='cuda:0') 'pt_full_loss_hw_dn_3': tensor(0., device='cuda:0') 'pt_full_loss_hw_dn_4': tensor(0., device='cuda:0') 'pt_full_loss_xy': tensor(nan, device='cuda:0') 'pt_full_loss_xy_0': tensor(nan, device='cuda:0') 'pt_full_loss_xy_1': tensor(nan, device='cuda:0') 'pt_full_loss_xy_2': tensor(nan, device='cuda:0') 'pt_full_loss_xy_3': tensor(nan, device='cuda:0') 'pt_full_loss_xy_4': tensor(nan, device='cuda:0') 'pt_full_loss_xy_dn': tensor(0., device='cuda:0') 'pt_full_loss_xy_dn_0': tensor(0., device='cuda:0') 'pt_full_loss_xy_dn_1': tensor(0., device='cuda:0') 'pt_full_loss_xy_dn_2': tensor(0., device='cuda:0') 'pt_full_loss_xy_dn_3': tensor(0., device='cuda:0') 'pt_full_loss_xy_dn_4': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn_0': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn_1': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn_2': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn_3': tensor(0., device='cuda:0') 'pt_window_cardinality_error_dn_4': tensor(0., device='cuda:0') 'pt_window_loss_bbox': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_0': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_1': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_2': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_3': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_4': tensor(nan, device='cuda:0') 'pt_window_loss_bbox_dn': tensor(0., device='cuda:0') 'pt_window_loss_bbox_dn_0': tensor(0., device='cuda:0') 'pt_window_loss_bbox_dn_1': tensor(0., device='cuda:0') 'pt_window_loss_bbox_dn_2': tensor(0., device='cuda:0') 'pt_window_loss_bbox_dn_3': tensor(0., device='cuda:0') 'pt_window_loss_bbox_dn_4': tensor(0., device='cuda:0') 'pt_window_loss_ce': tensor(0.0751, device='cuda:0') 'pt_window_loss_ce_0': tensor(0.0818, device='cuda:0') 'pt_window_loss_ce_1': tensor(0.0988, device='cuda:0') 'pt_window_loss_ce_2': tensor(0.0852, device='cuda:0') 'pt_window_loss_ce_3': tensor(0.0929, device='cuda:0') 'pt_window_loss_ce_4': tensor(0.0972, device='cuda:0') 'pt_window_loss_ce_dn': tensor(0., device='cuda:0') 'pt_window_loss_ce_dn_0': tensor(0., device='cuda:0') 'pt_window_loss_ce_dn_1': tensor(0., device='cuda:0') 'pt_window_loss_ce_dn_2': tensor(0., device='cuda:0') 'pt_window_loss_ce_dn_3': tensor(0., device='cuda:0') 'pt_window_loss_ce_dn_4': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn_0': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn_1': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn_2': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn_3': tensor(0., device='cuda:0') 'pt_window_loss_giou_dn_4': tensor(0., device='cuda:0') 'pt_window_loss_hw': tensor(nan, device='cuda:0') 'pt_window_loss_hw_0': tensor(nan, device='cuda:0') 'pt_window_loss_hw_1': tensor(nan, device='cuda:0') 'pt_window_loss_hw_2': tensor(nan, device='cuda:0') 'pt_window_loss_hw_3': tensor(nan, device='cuda:0') 'pt_window_loss_hw_4': tensor(nan, device='cuda:0') 'pt_window_loss_hw_dn': tensor(0., device='cuda:0') 'pt_window_loss_hw_dn_0': tensor(0., device='cuda:0') 'pt_window_loss_hw_dn_1': tensor(0., device='cuda:0') 'pt_window_loss_hw_dn_2': tensor(0., device='cuda:0') 'pt_window_loss_hw_dn_3': tensor(0., device='cuda:0') 'pt_window_loss_hw_dn_4': tensor(0., device='cuda:0') 'pt_window_loss_xy': tensor(nan, device='cuda:0') 'pt_window_loss_xy_0': tensor(nan, device='cuda:0') 'pt_window_loss_xy_1': tensor(nan, device='cuda:0') 'pt_window_loss_xy_2': tensor(nan, device='cuda:0') 'pt_window_loss_xy_3': tensor(nan, device='cuda:0') 'pt_window_loss_xy_4': tensor(nan, device='cuda:0') 'pt_window_loss_xy_dn': tensor(0., device='cuda:0') 'pt_window_loss_xy_dn_0': tensor(0., device='cuda:0') 'pt_window_loss_xy_dn_1': tensor(0., device='cuda:0') 'pt_window_loss_xy_dn_2': tensor(0., device='cuda:0') 'pt_window_loss_xy_dn_3': tensor(0., device='cuda:0') 'pt_window_loss_xy_dn_4': tensor(0., device='cuda:0')} ```

When I was preparing the PointOdyssey train videos, I followed the implementation of __get_item__ in PointTrackingDataset in kubric.py, so that the format of samples and targets returned by my PointOdysseyDataset are the same as PointTrackingDataset. I'm fairly certain it should all match. Other than that, there are no other changes of significance, and the file config/TAPTR.py hasn't been changed in any way.

Here's the command I used to launch the training (2 H100 80GB GPUs are used):

python -m torch.distributed.launch --nproc_per_node=2 main.py \
    -c config/TAPTR.py \
    --dataset_file point_odyssey \
    --data_path /path/to/pointodyssey \
    --output_dir logs/train_taptr \
    --num_workers 2 \
    --options num_samples_per_video=56 num_queries_per_video=128

num_samples_per_video=56 and num_queries_per_video=128 are set that way because the PointOdyssey training sequences I have are 56 frames long, each with 128 points. For this run, I only used 10 sequences just to check if everything goes through.

Here are the raw requirements I installed in a Python 3.10 environment (that's what's available in the cluster) using CUDA 12.2.2:

Requirements

```python torch==2.3.1 torchvision==0.18.1 numpy==1.26.4 tqdm opencv-python moviepy mediapy matplotlib gradio gradio-image-prompter timm scipy # MultiScaleDeformableAttention like in deformable DETR addict yapf==0.40.1 # https://github.com/open-mmlab/mmdetection/issues/10962 pycocotools termcolor albumentations tensorboard ```

Yes, I am aware that my versions of most things are different, but I didn't manage to find any reason why they wouldn't work.

Do you maybe have any ideas what I should check, or why this is happening? I would love to be able to run training in debug mode, but I don't think this is possible on an HPC cluster. Thank you in advance!

LHY-HongyangLi commented 1 month ago

Hi @lukaboljevic , Thank you for your long-term attention. The logs show that the training process stops because of it get a loss that is "nan". Since the loss for classification is normal (pt_full_loss_ce_xxx are normal floating numbers), but the losses for location regression (pt_full_loss_xy_xx) turn out to be nan.

Therefore, I think the issue might be in the process of updating the position in the decoder. You can set a breakpoint through ipdb or pudb at this line of code and step through to trace where the NaN first appears. https://github.com/IDEA-Research/TAPTR/blob/fcef6a9305ad5d9be7467c884a939a61447138bb/models/dino/deformable_transformer.py#L1030

lukaboljevic commented 1 month ago

Thanks for your quick answer. I will have a look in the upcoming days and let you know what I established.

lukaboljevic commented 1 month ago

I figured out the problem. I forgot to normalize the point coordinates to range [0, 1] before returning the sample. When I corrected this, the error went away.

Before I close the issue, I have a few more questions regarding training. My training set (currently) consists of 24190 sequences (though I can use less), each of which is 56 frames long, with 128 points.

Did you really train for 150 epochs? This is the value set in config/TAPTR.py. I did a very rough calculation, and I believe I can only train for 7 or 8 epochs (due to job limitations on the cluster), if I use all the sequences and 1 node with 2 GPUs. Do you have any tips/suggestions here?
Functionally, I don't believe I need to change anything else apart from num_queries_per_video and num_samples_per_video?
Any tips to reduce memory usage during training? The memory efficient mode for inference is amazing, though it would be great if the memory requirement for training could be slightly reduced.

Thanks in advance!

LHY-HongyangLi commented 1 month ago

Hi @lukaboljevic I'm sorry for this late reply.

In fact we will stop the training process at about 120 epochs. If you only train 7-8 epoch the model may not be converged. My suggestion is to continue resuming the training process every time your job is killed by your cluster (our training script can automatically resume from the latest checkpoint), and find out at which epoch the model converges.
In theory, yes. However, some hyperparameters might need adjustment to achieve optimal results from the network. For example, adjustments to data augmentation and other factors might be necessary.
You can decrease the number of encoder/decoder to 1/2, and decrease the point-queries from 800 to 128, you can also decrease the input resolution to decrease the memory requirements.

lukaboljevic commented 3 weeks ago

Sorry for the late reply. Thank you for the tips, will try em out!

IDEA-Research / TAPTR

Training stops trying to train on PointOdyssey #8