deep-over / FiLM

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
8 stars 1 forks source link

About pre-training dataset #7

Closed BUILDERlym closed 8 months ago

BUILDERlym commented 8 months ago

I just found two MISC dataset and want to know where can I find the remaining 8 datasets, thanks!

deep-over commented 8 months ago

Hello, The two MISC datasets are example datasets. Due to copyright issues, we are unable to directly provide the datasets. All the datasets used are introduced in Appendix A of the paper.

1) TRC2 and AIHUB can be downloaded directly.

2) The datasets from Investing.com, NYtimes, EIA, Earnings call, Arxiv, FinWEB, and Investopedia were used by crawling the referenced websites in the paper.

3) SEC filings can be crawled or downloaded directly. If you directly download from the SEC, you can only use the dataset up until ~2019.

BUILDERlym commented 8 months ago

Gotcha, for Earnings calls, do you mind sharing your crawling code? I tried some but all blocked by captcha, thanks!

deep-over commented 8 months ago

I used the following GitHub as a reference for my crawling:

https://github.com/RCJansonVTFL/SeekingAlphaWebScrape During the process, I encountered numerous instances where I was blocked, so I increased the waiting time to continue. It seems to me that, compared to the past, there are now measures in place to prevent crawling by blocking it more frequently.

Below is the code I used, which you might find useful for collecting data. a_data_collect_alpha_crawler.zip

BUILDERlym commented 8 months ago

really appreciate it!