eliangcs / pystock-crawler

(UNMAINTAINED) Crawl and parse financial reports (XBRL) from SEC EDGAR, and daily stock prices from Yahoo Finance
MIT License
311 stars 99 forks source link

"Cannot find context" after a long run #2

Closed eliangcs closed 9 years ago

eliangcs commented 10 years ago

After crawling EDGAR for hours using pystock-crawler reports command, it has a great possibility that a lot of these warning messages show up in the log:

[scrapy] WARNING: Cannot find context: eol_PE5972----1310-Q0007_STD_273_20130930_0 in http://www.sec.gov/Archives/edgar/data/41719/000119312513438262/glt-20130930.xml
[scrapy] WARNING: Cannot find context: Y11Q4 in http://www.sec.gov/Archives/edgar/data/1041368/000093905712000032/rvsb-20111231.xml
[scrapy] WARNING: Cannot find context: D111001_120331 in http://www.sec.gov/Archives/edgar/data/1046050/000093905712000146/tsbk-20120331.xml

This makes those reports have many null values. Perhaps it is because the crawler hits EDGAR too often, making EDGAR return bad content.

eliangcs commented 10 years ago

There's another bug that may share the same root cause of this bug. When running pystock-crawler reports for a long time, say crawling 5k+ symbols, many of the filings are bypassed since the parser can't obtain the document type via xpath, even though I'm sure the document type is 10-Q or 10-K. This bug is reproducible only if you have a large list of input symbols. And it seems to happen to the same set of filings, those which come latter in crawling order.