DAMO-DI-ML / NeurIPS2023-One-Fits-All

The official code for "One Fits All: Power General Time Series Analysis by Pretrained LM (NeurIPS 2023 Spotlight)"
441 stars 61 forks source link

Mismatched classification accuracy #20

Open besaman opened 9 months ago

besaman commented 9 months ago

Hi, regarding the classification task, I have two questions.

  1. You reported an average accuracy of 74.0% on the selected UEA datasets, while I can only get 71.4%. Can you help to achieve a similar performance as yours?
  2. Why don't you have a validation split? How do you assess your model? Do you try to get the best performance on the test set?
tianzhou2011 commented 9 months ago
  1. The classification outcome displays a high degree of sensitivity, exhibiting substantial variance in response to even minor alterations such as changing a random seed.
  2. We merely adapted and aligned the setting from TImesnet to avoid recalculation all the baseline results. Admittedly, this approach is not optimal. However, we opted to maintain consistency with the established settings to avoid the necessity of justifying any deviations to reviewers.

On Sun, Dec 17, 2023 at 6:13 AM besaman @.***> wrote:

Hi, regarding the classification task, I have two questions.

  1. You reported an average accuracy of 74.0% on the selected UEA datasets, while I can only get 71.4%. Can you help to achieve a similar performance as yours?
  2. Why don't you have a validation split? How do you assess your model? Do you try to get the best performance on the test set?

— Reply to this email directly, view it on GitHub https://github.com/DAMO-DI-ML/NeurIPS2023-One-Fits-All/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGO3GPKKFMXM6WVX4ORLYJYMKLAVCNFSM6AAAAABAX3VRM6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DIOJXGY4DGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

besaman commented 9 months ago

Thanks for your reply.

  1. I understand that time series data are sensitive because it is small. But do you think the difference should be that high? especially that it is the average of 10 datasets.

  2. I understand your point but its validity is questionable. I would suggest that you adjust the code to include a validation set to maintain accurate settings, especially with having trouble reproducing the results anyway.

PSacfc commented 9 months ago

Because there are some datasets with small sample sizes, the task itself will have some randomness. Additionally, apart from the learning rate, the settings of patch size and stride will also affect the final results. We will update the script files soon.

besaman commented 8 months ago

We will update the script files soon.

When I'm expected to find these updates?

PSacfc commented 8 months ago

Recently, as the year-end approaches, our business has been quite busy, so we haven't had the chance to update. You can search for parameters following the example below (SelfRegulationSCP1) to obtain results that meet the accuracy requirements of the article.

for lr in 0.002 0.001 0.0005 0.0001
do
for patch in 16 8 4 2 1
do
for stride in 16 8 4 2 1
do

python src/main.py \
    --output_dir experiments \
    --comment "classification from Scratch" \
    --name SelfRegulationSCP1 \
    --records_file Classification_records.xls \
    --data_dir ./datasets/SelfRegulationSCP1 \
    --data_class tsra \
    --pattern TRAIN \
    --val_pattern TEST \
    --epochs 50 \
    --lr $lr \
    --patch_size $patch \
    --stride $stride \
    --optimizer RAdam \
    --d_model 768 \
    --pos_encoding learnable \
    --task classification \
    --key_metric accuracy

done
done
done
截屏2024-01-01 下午5 10 47
RaptorMai commented 2 months ago

Hi, I tried to reproduce the results for classification and most scripts your provided work well except for JapaneseVowels. The reported results is 98.6 but I can get 82.4. The discrepancy is quite big. Could you help take a look at this issue?