AI4Finance-Foundation / FinGPT

FinGPT: Open-Source Financial Large Language Models! Revolutionize 🔥 We release the trained model on HuggingFace.
https://ai4finance.org
MIT License
13.48k stars 1.88k forks source link

Datasets used in the fine-tuning process #68

Open itlittlekou opened 1 year ago

itlittlekou commented 1 year ago

I've noticed that while creating the dataset, the news headlines and news content were separated. This means that there are distinct training and testing sets for news headlines, as well as for news content. However, during the fine-tuning process, only the dataset containing news headlines was utilized, and the dataset with news content wasn't employed. Consequently, I'm somewhat perplexed about the role of the news content dataset in the fine-tuning process. image image

oliverwang15 commented 1 year ago

Hi, itlittlekou. You are right! We used news headlines in our experiment for time and cost reasons. Since news content contains most of the information, the best way is to use both the news headlines and news content.

However, in our experiment, we need to concat all the news related to certain stocks for a certain time period like one day. So the tokens might be super large and it might be quite difficult to train. We believe the best way might be to use the title and summary or just the title or summary might also work. If you have better plans, please don't hesitate to contact us or create PR!

GitOutOfMyBed commented 8 months ago

How did you create the dataset? I don't see an api to do that.