Closed BUILDERlym closed 8 months ago
Hello, The two MISC datasets are example datasets. Due to copyright issues, we are unable to directly provide the datasets. All the datasets used are introduced in Appendix A of the paper.
1) TRC2 and AIHUB can be downloaded directly.
2) The datasets from Investing.com, NYtimes, EIA, Earnings call, Arxiv, FinWEB, and Investopedia were used by crawling the referenced websites in the paper.
3) SEC filings can be crawled or downloaded directly. If you directly download from the SEC, you can only use the dataset up until ~2019.
Gotcha, for Earnings calls, do you mind sharing your crawling code? I tried some but all blocked by captcha, thanks!
I used the following GitHub as a reference for my crawling:
https://github.com/RCJansonVTFL/SeekingAlphaWebScrape During the process, I encountered numerous instances where I was blocked, so I increased the waiting time to continue. It seems to me that, compared to the past, there are now measures in place to prevent crawling by blocking it more frequently.
Below is the code I used, which you might find useful for collecting data. a_data_collect_alpha_crawler.zip
really appreciate it!
I just found two MISC dataset and want to know where can I find the remaining 8 datasets, thanks!