Open adrianshort opened 5 years ago
For Authority#scrape
:
def scrape(params, options = {})
Note:
params
is for what to scrape (the search terms sent to the site and the output desired) options
is for how to scrape (configuring the scraper's speed, user agent, etc).if the fields a user specifies to exclude amounts to a whole tab then we should omit scraping that tab.
True. And that's going to be a bunch of fun to code because different systems put their fields on different pages, so you'd need a data structure breaking down which systems, pages and fields correspond.
mapping from differing data structures can be done thus:
ret = []
key_map = { :council_reference=>:application_number,
:date_validated=>:date_validated,
:scraped_at=>:fetched_at,
:info_url=> :detail_page_link,
:address=>:site_address,
:description=>:description_of_development,
:documents_count=>:documents_count,
:documents_url=>:documents_page_link }
app.each do |app_hash|
ret << app_hash.map {|k, v| [key_map[k], v] }.to_h
end
Allow users to specify which fields they do or don't want included in the output.
Add
only
andexcept
to theparams
hash inAuthority#scrape
.Both of these should be a comma-separated list of field names.
Using
only
andexcept
params at the same time throws an error.We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like
documents: true
which scrapes the contents of documents pages.One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like
personal_data: false
.