results in Table 5 - Githubissues

kashif commented 11 months ago

Hi Marc,

Would you have any insight or intuition as to why the results with auto-sklearn are typically an order of magnitude better than say Auto-gluon?

Thanks

Ennosigaeon commented 11 months ago

Hey,

I can't give you a definitive answer but I can provide an educated guess. In our benchmark, we have a large focus on rather small datasets with short time series. As a consequence, methods relying on ML or neural networks are often not very good due to the very limited training data. In contrast, methods using statistical forecasting models, like AutoTS or PyAF, often perform much better. The advantage of auto-sktime over baseline methods focusing on statistical models is in the robustness of our method, i.e., handling time series with various defects and honoring the requested computational budget.

I hope this helps to answer your question.

Just as a side note: I was also a bit surprised that AutoGluon performed quite bad, considering that their performance on tabular data is really outstanding. But I never had the time to really dig into the implementation specifics of AutoGluons time series processing and it just work out of the box. Some of the other methods were a bit harder to configure accordingly and I had to dig into the code to ensure a fair benchmark.

kashif commented 11 months ago

I see... even with tiny datasets one can cook up a tiny model, e.g. with a tiny RNN on Airpassengers I remember getting a MASE of 0.4x.. while in the table it's around 3.x for DeepAR which caused me to ask...

I can dig around more too to see... yeah we can also get the Autogluon time series folks to ask as well... Great work, in any case, it's always good to investigate and figure out!

Ennosigaeon commented 11 months ago

Regarding DeepAR and also TFT: I used the implementation from pytorch_forecasting based on the examples (https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html) with default hyperparameters. I am sure that one can achieve much better performance with manually tuning the network architecture. Yet, we did not aim to compare an automated approach with a human expert but just wanted to add additional baselines for more context.

kashif commented 11 months ago

Un-related, its nice you have a collection of time series datasets, I can get that from the auto-sklearn repo too?

Ennosigaeon commented 11 months ago

Yes, I have added all the raw data I have collected from the internet to the repo and converted them to a uniform format: https://github.com/Ennosigaeon/auto-sktime/tree/main/autosktime/data/benchmark/data/timeseries

kashif commented 11 months ago

perfect thanks!

Ennosigaeon commented 11 months ago

So, your question just got me wondering about the results and I decided to dig into the results of one random dataset where AutoGluon has a worse performance (ec2_cpu_utilization_53ea38) a bit on my own. In Table 5, AutoGluon has a reported performance of 14.29, yet this is an error on my side. The raw performance is actually 0.31, pushing it much closer to the 0.28 of auto-sktime. The actual problem of AutoGluon for this dataset is the runtime. For this specific dataset, AutoGluon has an average runtime of 650sec, doubling the requested timeout Therefore, all evaluations are pruned with the worst result of any competitor (DeepAR) plus a penalty.

I will upload a revised version of the paper with Table 5, 6, and 7 fixed. The results in the main paper are not affected by this error as we explicitly discuss the impact of resource budgets, it is just that I do not report the raw results, as claimed... Thank you for actually asking about the surprising results and making me think about it. Sorry for the confusion.

kashif commented 11 months ago

Ah Great! No need to apologize it’s how science happens. As mentioned it’s great work and I look forward to more insights!

Ennosigaeon / auto-sktime

results in Table 5 #1