ana-kuznetsova / Popular-Science-Texts-Compling-research

An M.A. educational project on computational linguistics.
4 stars 3 forks source link

The Thomas Crawl Affair #9

Closed nevmenandr closed 6 years ago

nevmenandr commented 6 years ago

The name of the issue means: let's crawl! We have two ways. The first way is to make a crawler (or spider or scraper) ourself. These liks must be helpful: 1, 2, 3

The second way is to use the most popular python library for crawling: Scrapy. Hope that this link would be good for the quick start.

So agenda is like this:

  1. Make the final list of the sources. What we must add? What it would be better remove?
  2. Choose your way: to crawl with your own code or to learn Scrapy API.
  3. Choose your sources to crawl. Try to keep as much metadata and markup as possible. You never know what you will need for.
  4. Crawl your source.
  5. Profit!