eliangcs / pystock-crawler

(UNMAINTAINED) Crawl and parse financial reports (XBRL) from SEC EDGAR, and daily stock prices from Yahoo Finance
MIT License
311 stars 105 forks source link

"Cannot find context" after a long run #2

Closed eliangcs closed 9 years ago

eliangcs commented 10 years ago

After crawling EDGAR for hours using pystock-crawler reports command, it has a great possibility that a lot of these warning messages show up in the log:

[scrapy] WARNING: Cannot find context: eol_PE5972----1310-Q0007_STD_273_20130930_0 in http://www.sec.gov/Archives/edgar/data/41719/000119312513438262/glt-20130930.xml
[scrapy] WARNING: Cannot find context: Y11Q4 in http://www.sec.gov/Archives/edgar/data/1041368/000093905712000032/rvsb-20111231.xml
[scrapy] WARNING: Cannot find context: D111001_120331 in http://www.sec.gov/Archives/edgar/data/1046050/000093905712000146/tsbk-20120331.xml

This makes those reports have many null values. Perhaps it is because the crawler hits EDGAR too often, making EDGAR return bad content.

eliangcs commented 9 years ago

There's another bug that may share the same root cause of this bug. When running pystock-crawler reports for a long time, say crawling 5k+ symbols, many of the filings are bypassed since the parser can't obtain the document type via xpath, even though I'm sure the document type is 10-Q or 10-K. This bug is reproducible only if you have a large list of input symbols. And it seems to happen to the same set of filings, those which come latter in crawling order.