Add script to move Versionista data into a more accessible csv

KrishnaKulkarni commented 7 years ago

After some exploration today, I think I should be able to put together a ruby script that will scrape our versionista data and output a csv like the one that has been previously manually put together.

The beginnings of that script will look like this:

# Dependencies:
# phantomjs - `brew install phantomjs` (this assumes you have Homebrew)
# gems and Ruby 2.2.3 from Gemfile - `bundle install` (assuming you have the `bundler` gem)

# Login and visit the relevant pages
require 'capybara/poltergeist'

class Browser
  def self.new_session
    Capybara.register_driver :poltergeist do |app|
      Capybara::Poltergeist::Driver.new(app, js_errors: false)
    end

    # Configure Capybara to use Poltergeist as the driver
    Capybara.default_driver = :poltergeist

    Capybara.current_session
  end
end

class VersionistaBrowser
  attr_reader :session

  def initialize
    @session = Browser.new_session
  end

  def log_in(email:, password:)
    session.visit(log_in_url)
    session.fill_in("E-mail", with: email)
    session.fill_in("Password", with: password)
    session.click_button("Log in")
  end

  def scrape_each_page_version
    # Implementation TBD
    # Output: A hash whose keys are page urls and
    # whose values hashes of scraped data.
  end

  private

  def log_in_url
    url = "https://versionista.com/login"
  end
end

browser = VersionistaBrowser.new

browser.log_in(email: ENV.fetch("EMAIL"), password: ENV.fetch("PASSWORD"))
data = browser.scrape_each_page_version

# Write the CSV
require_relative 'csv_writer'

headers = [
  'Page Name',
  'URL',
  'Page View URL',
  'Comparison URL',
]
csv_writer = CSVWriter.new(filename_title: "versionista_data", headers: headers)

data.each do |url, scraped_data_hash|
# show_page_url = "https://versionista.com/74487/6243163/"
# data = {
#   'Page Name' => "Developers - Data.gov",
#   'URL' => "https://versionista.com/goto?https://www.data.gov/developers/",
#   'Page View URL' => show_page_url,
#   'Comparison URL' => "https://versionista.com/74487/6243163/9633207:0/",
# }
  csv_writer.add_rows(url: url, rows: [scraped_data_hash])
end

csv_writer.write!

I expect to be able to get an MVP of this script after work tomorrow (Jan 30), however, I do not yet know how performant this script will be. Fingers cross it can run in a reasonable amount of time.

KrishnaKulkarni commented 7 years ago

Update: I worked out a functioning ruby script that successfully scraped 5 archived pages from each of 5 websites that Versionista had crawled, and spit out relevant data into a csv. I'm having @ambergman review that to determine how the output should be altered, but I hope to submit a pull request with the first version of my cleaned up script tomorrow.

trinberg commented 7 years ago

Clarifications for Output Column headers:

"index" - generate some kind of unique id for each row
"Output Date/Time" - Datetime when script was run
"Site Name" - This is the name associated with the "site view" on versionista. (We will change all site names to look likes this: EPA - EIA... )
"Agency" - Tokenize "Site Name" and take the first part before the dash. (E.g. Site Name = EPA - EIA ---> Agency = EPA)
"URL" - The actual URL of the page being monitored
"Page View URL" - The Versionista URL associated with the Page View on the Versionista UI
"Last Two - Side by Side" - URL for the versionista comparison view of the latest two versions
"Latest to Base - Side by Side" - URL for the versionista comparison view of the latest to the original base version. This will require using the dropdown to compare the most recent version to the base.
"Date Found - Latest" - Datetime for latest found version
"Date Found - Base" - Datetime for original base version

Here's the output/example spreadsheet again: https://docs.google.com/spreadsheets/d/1V4TAEjvcjiiTvVHlqkiCEBLS49qqDBe7iWkrKqvQBm4/edit#gid=1326237029

ambergman commented 7 years ago

Thanks so much @KrishnaKulkarni, this is really great. For the output format, see @trinberg's comment above and the shared spreadsheet.

In addition here's what we discussed (with some minor additions):

Have each site (each separate "Site" is listed on the "Dashboard) output to its own CSV, which will all be stored together in a zipped folder.
Instead of scraping 5 archived pages from each site, scrape all of the pages that have changed in the last N hours (check the "Last New" column in the "Site View" to determine this) - where N is an input parameter when the script runs.
If possible, have the script run from the command line, accepting 3 parameters: user name, password, and N (hours since last page change). Taking in user name and password will be important as there are two different accounts we will be using.

jpmckinney commented 7 years ago

@KrishnaKulkarni In terms of getting the Comparison URL for Latest to Base - Side by Side, the logic for that in JavaScript is here: https://github.com/edgi-govdata-archiving/version-tracking-ui/blob/gh-pages/browser-tool.js#L34-L59 The dropdowns aren't actually rendered in the HTML, so the code pulls the necessary data from the JavaScript.

Note that, in my version, it's performing a "latest version to last version reviewed", instead of a "latest version to first version saved" (latest to base). I don't know if that functionality is still desired - it seems to be a good thing to have, since it allows reviewers to start where they left off instead of starting fresh every time they review.

Note that the :0 in the current Comparison URL is just a shortcut for using the actual next-most-recent ID (i.e. the URLs would redirect to the same page).

titaniumbones commented 7 years ago

@KrishnaKulkarni if you have something online by 6 maybe some volunteers at civic tech can start implementing site-specific filters for say 2-3 of the most urgent-feeling domains (@tringberg @ambergman if you could create/link to gists with the most common false positives in a couple of places, we might be able to assign them to someone. If not then we can also wait on this till next week -- I think w/ NYC coming up fast it will be hard to keep track of this after say Wed., unless @jpmckinney and @geppy feel like they have time (yes??).

BTW i see there's a free tier at versionista, so maybe devs who want to hack on this can sign up for a free account, add a few relevant pages (real positives, false positives) and start hacking? Or, I guess recording only starts once you sign up... Maybe we should make a free account 3 that we can give out creds for. In any case I'm reluctant to hand the account creds out will-nilly. Last week I temporarily changed the password, but that screwed the search team over until I changed them back.

edgi-govdata-archiving / version-tracking-ui

Add script to move Versionista data into a more accessible csv #10