bnjy-opengov / chi-solicitations-feed

Wrapping the City of Chicago's publishing of solicitations in a machine-readable way.
1 stars 1 forks source link

Scrape paginated data feature #1

Open pebreo opened 10 years ago

pebreo commented 10 years ago

Feature: Scrape data from solications[1] and contracts[2] page

Scenario: User starts a page crawl
  Given a certain url
  When the page loads
  Then the table data should be copied into a data structure 

Scenario: User continues a page crawl 
  Given a page has "next"
  When it is done getting the table data in the current page
  Then the program should continue to the next page 
  And copy the table data into a data structure

[1] https://webapps1.cityofchicago.org/VCSearchWeb/org/cityofchicago/vcsearch/controller/solicitations/pagination.do#searchResults

[2] https://webapps1.cityofchicago.org/VCSearchWeb/org/cityofchicago/vcsearch/controller/contracts/search.do

nrrb commented 10 years ago

I would generally call this "Scrape Paginated Data Feature", and by doing so we can search for design patterns for scraping from sites that paginate their data. For instance:

http://www.hackhowtofaq.com/blog/web-scraping-with-scrapy/ http://philipcodings.com/post/recursively-crawling-a-website-with-python-and-scrapy#.Uq9ZTZBDu-4 http://stackoverflow.com/questions/17937545/scrapy-pagination-selenium-python

pebreo commented 10 years ago

I'll see you tomorrow to discuss the nomenclature :)