KimMeen / Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
https://arxiv.org/abs/2310.01728
Apache License 2.0
1.19k stars 206 forks source link

The purpose of parameter "d_ff" ? #59

Closed genius77777 closed 4 months ago

genius77777 commented 5 months ago

Hi there, What is the purpose of the d_ff parameter? I observed that it is configured to 128 or 32 within the script. I am interested in understanding the rationale behind not leveraging all outputs generated by the large language model. Is the selection of 128 a result of a hyperparameter search? If so, I would like to know what motivations you have for searching for this parameter. If not, I would like to know why you do not use the complete results of output, and for what reasons you set the parameter d_ff to 128. d_ff Looking forward to your reply!

xlwang233 commented 4 months ago

Hi @genius77777 I think the author has answered this in issue #56 , in which the author responded "This is a common method in deep learning, where we directly perform dimensionality reduction by truncating the dimensions. Alternatively, a linear layer can be used to achieve this transformation. However, to reduce the number of parameters".

However, this is indeed the first time I've ever seen operations like this to reduce the dimensionality. I would like to ask the authors @kwuking @KimMeen if there is any literature/references for this truncating operation? Or have you run any experiments to prove that truncating does not hurt the performance, compared to "using a linear layer" or "using the entire output"? Thanks in advance.

genius77777 commented 4 months ago

Thanks for your answer @xlwang233, but I believe the truncating operation lacks justification. Given that it's a frozen language model, the first 128 dimensions of the output hold no specific significance; they neither represent a patch nor a principal component. If you have any other insights or interpretations, I'm eager to hear them!

xlwang233 commented 4 months ago

@genius77777 Yes I have similar doubts. I'm looking forward to hearing any insights/answers from the authors.

kwuking commented 4 months ago

Thanks for your answer @xlwang233, but I believe the truncating operation lacks justification. Given that it's a frozen language model, the first 128 dimensions of the output hold no specific significance; they neither represent a patch nor a principal component. If you have any other insights or interpretations, I'm eager to hear them!

@genius77777 Yes I have similar doubts. I'm looking forward to hearing any insights/answers from the authors.

we express their gratitude for the keen interest and profound discussions surrounding our work. As elucidated by@xlwang233, the implementation of direct truncation as a method for dimensionality reduction serves to strike an equilibrium between computational efficiency and performance outcomes. It is acknowledged that similar results might be attainable through the employment of Multilayer Perceptrons (MLPs) or Linear layers. Concerns have been raised regarding the potential detrimental effects of information loss on the output quality of large-scale models. However, it is important to note that the reprogramming and output modules within our framework are designed to be trainable. Leveraging the inherent properties of backpropagation, these modules inherently concentrate the model's output on the dimensions of d_ff, as exemplified by the recent adoption of analogous techniques in models such as GPT4TS. Furthermore, Autoformer and TimesNet employ direct truncation for the processing of frequency information.

A pivotal inquiry arises concerning the necessity of utilizing the full output dimensionality provided by large models in the context of time series analysis. This question invites a philosophical and conceptual examination of the nature of diverse modal data across the globe. To address this, we draw upon linear algebra's concept of matrix rank, which signifies the maximal count of linearly independent columns, thereby indicating the sufficiency of information encapsulated within the data. Textual data, amassed globally for the training of Large Language Models (LLMs), undeniably constitutes high-rank data replete with substantial information content. Conversely, time series data, despite its apparent voluminosity, predominantly consists of repetitive patterns and noise, categorizing it as relatively low-rank in comparison to textual data.

It is postulated that the regularities inherent in time series data may constitute a subset of the patterns found within global textual datasets. Given this perspective, it is argued that not all information from extensively pre-trained language models with substantial d_model dimensions is requisite. For the analysis of limited time series datasets, a fraction of this information suffices, leading to the selection of direct truncation as the most straightforward approach. These considerations form the basis of our underlying rationale and hypotheses. What ultimately led us to directly use this approach was experimental effectiveness. We tried using MLP or Linear Projection for dimensionality reduction and found no substantial difference in final performance compared to direct truncation. Hence, direct truncation was conclusively favored for its efficacy. Finally, regarding the selection of the d_ff parameter, it is actually a process of hyperparameter search, which depends on different datasets.

xlwang233 commented 4 months ago

@kwuking Thanks for your explanation!

KimMeen commented 4 months ago

Thanks, @kwuking. Please feel free to reopen and let us know if you have further questions/suggestions. 🤗