SalesforceAIResearch / uni2ts

Unified Training of Universal Time Series Forecasting Transformers
Apache License 2.0
913 stars 101 forks source link

Earlystop when using Finetune CLI? #6

Closed zqiao11 closed 6 months ago

zqiao11 commented 8 months ago

Hi, thank you for the great work! I got a problem when I tried to run the finetuning example with the given CLI:

python -m cli.finetune \
  run_name=example_run \ 
  model=moirai_1.0_R_small \ 
  data=etth1 \ 
  val_data=etth1

The experiment terminated after several epochs, and I got the following results:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-03-27 17:36:10,732][datasets][INFO] - PyTorch version 2.2.1 available.
[2024-03-27 17:36:10,732][datasets][INFO] - JAX version 0.4.25 available.
Seed set to 1
/home/eee/qzz/uni2ts/venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:653: Checkpoint directory /home/eee/qzz/uni2ts/outputs/finetune/moirai_1.0_R_small/etth1/example_run/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type         | Params
----------------------------------------
0 | module | MoiraiModule | 13.8 M
----------------------------------------
13.8 M    Trainable params
0         Non-trainable params
13.8 M    Total params
55.310    Total estimated model params size (MB)
Epoch 7: |                                                                                                    | 100/? [01:21<00:00,  1.22it/s, v_num=1, val_loss=9.080, PackedNLLLoss=-0.241]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Can you please suggest me how to fix this problem? Thank you!

liu-jc commented 8 months ago

Hi @zqiao11 , the results only mention the warning? No error? I think that, likely the fine-tuning processing has finished. Could you check outputs folder and see if you can find the checkpoint?

zqiao11 commented 8 months ago

Hi @liu-jc. Yes, the results only mention the warning and then the program terminates. There are checkpoints in the outputs folder. For example, one checkpoint is named as epoch=4-step=500.ckpt.

Is that normal? May I know how many epochs should it run in the default CLI's configurations?

liu-jc commented 8 months ago

Hi @zqiao11 , I think that is normal. For the default maximum number of epochs, it is 100 (as defined in conf/finetune/default.yml). See below (updated, sry the previous link was not correct): https://github.com/SalesforceAIResearch/uni2ts/blob/8e07e899716c970787e9f2224e847c66c59d3eaf/cli/conf/finetune/default.yaml#L40 The program is terminated by early stop. Feel free to change the max num of epochs.

liu-jc commented 8 months ago

@zqiao11 , thanks for pointing out this. This is a good point. We will consider to print more information when the fine-tune finishes normally.

cc @gorold , what do you think about this? I can print more information at the end of the fine-tune file (e.g., the fine-tuning finished and the checkpoint can be found in xxx folder). I also feel a bit confused when I face this issue for the first time. Will submit a PR for you to review.

zqiao11 commented 8 months ago

Thanks for your prompt reply @liu-jc. I just took a further look of default.yaml for finetuning.

Yes, it seemed to be normal as the patience of early stopping is 3. So, the program terminated after the 7-th epoch, and the checkpoint is the 4-th epoch.

A small improvement suggestion: it would be better if a message could be printed at the experiment's conclusion, indicating that the program has successfully finished due to early stopping :))

gorold commented 8 months ago

Does adding the verbose=True to the lightning earlystopping callback work?

zqiao11 commented 8 months ago

Yes, it would work by showing the following information: Monitored metric val_loss did not improve in the last 3 records. Best score: 10.386. Signaling Trainer to stop.

But I still think it would be nice to include an additional prompt indicating that the entire process has finished. That would be more friendly for the beginners who are not familiar with the codes.