About pre-training dataset

BUILDERlym commented 8 months ago

I just found two MISC dataset and want to know where can I find the remaining 8 datasets, thanks!

deep-over commented 8 months ago

Hello, The two MISC datasets are example datasets. Due to copyright issues, we are unable to directly provide the datasets. All the datasets used are introduced in Appendix A of the paper.

1) TRC2 and AIHUB can be downloaded directly.

Please refer to the paper's Appendix link.

2) The datasets from Investing.com, NYtimes, EIA, Earnings call, Arxiv, FinWEB, and Investopedia were used by crawling the referenced websites in the paper.

3) SEC filings can be crawled or downloaded directly. If you directly download from the SEC, you can only use the dataset up until ~2019.

edgar-corpus is available at: https://zenodo.org/record/5528490
EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)

BUILDERlym commented 8 months ago

Gotcha, for Earnings calls, do you mind sharing your crawling code? I tried some but all blocked by captcha, thanks!

deep-over commented 8 months ago

I used the following GitHub as a reference for my crawling:

https://github.com/RCJansonVTFL/SeekingAlphaWebScrape During the process, I encountered numerous instances where I was blocked, so I increased the waiting time to continue. It seems to me that, compared to the past, there are now measures in place to prevent crawling by blocking it more frequently.

Below is the code I used, which you might find useful for collecting data. a_data_collect_alpha_crawler.zip

BUILDERlym commented 8 months ago

really appreciate it!

deep-over / FiLM

About pre-training dataset #7