adrianshort / uk_planning_scraper

A Ruby gem to get planning applications data from UK council websites.
GNU Lesser General Public License v3.0
27 stars 19 forks source link

Filter output field list #20

Open adrianshort opened 5 years ago

adrianshort commented 5 years ago

Allow users to specify which fields they do or don't want included in the output.

Add only and except to the params hash in Authority#scrape.

Both of these should be a comma-separated list of field names.

Using only and except params at the same time throws an error.

We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like documents: true which scrapes the contents of documents pages.

One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like personal_data: false.

adrianshort commented 5 years ago

For Authority#scrape:

def scrape(params, options = {})

Note:

KeithP commented 5 years ago

if the fields a user specifies to exclude amounts to a whole tab then we should omit scraping that tab.

adrianshort commented 5 years ago

True. And that's going to be a bunch of fun to code because different systems put their fields on different pages, so you'd need a data structure breaking down which systems, pages and fields correspond.

KeithP commented 5 years ago

mapping from differing data structures can be done thus:

        ret = []
        key_map = { :council_reference=>:application_number,
                    :date_validated=>:date_validated,
                    :scraped_at=>:fetched_at,
                    :info_url=> :detail_page_link,
                    :address=>:site_address,
                    :description=>:description_of_development,
                    :documents_count=>:documents_count,
                    :documents_url=>:documents_page_link }
        app.each do |app_hash|
          ret << app_hash.map {|k, v| [key_map[k], v] }.to_h
        end