edgi-govdata-archiving / version-tracking-ui

ARCHIVED--Bookmarklet to modify UI for Versionista website monitoring
MIT License
3 stars 1 forks source link

Add script to move Versionista data into a more accessible csv #10

Open KrishnaKulkarni opened 7 years ago

KrishnaKulkarni commented 7 years ago

After some exploration today, I think I should be able to put together a ruby script that will scrape our versionista data and output a csv like the one that has been previously manually put together.

The beginnings of that script will look like this:

# Dependencies:
# phantomjs - `brew install phantomjs` (this assumes you have Homebrew)
# gems and Ruby 2.2.3 from Gemfile - `bundle install` (assuming you have the `bundler` gem)

# Login and visit the relevant pages
require 'capybara/poltergeist'

class Browser
  def self.new_session
    Capybara.register_driver :poltergeist do |app|
      Capybara::Poltergeist::Driver.new(app, js_errors: false)
    end

    # Configure Capybara to use Poltergeist as the driver
    Capybara.default_driver = :poltergeist

    Capybara.current_session
  end
end

class VersionistaBrowser
  attr_reader :session

  def initialize
    @session = Browser.new_session
  end

  def log_in(email:, password:)
    session.visit(log_in_url)
    session.fill_in("E-mail", with: email)
    session.fill_in("Password", with: password)
    session.click_button("Log in")
  end

  def scrape_each_page_version
    # Implementation TBD
    # Output: A hash whose keys are page urls and
    # whose values hashes of scraped data.
  end

  private

  def log_in_url
    url = "https://versionista.com/login"
  end
end

browser = VersionistaBrowser.new

browser.log_in(email: ENV.fetch("EMAIL"), password: ENV.fetch("PASSWORD"))
data = browser.scrape_each_page_version

# Write the CSV
require_relative 'csv_writer'

headers = [
  'Page Name',
  'URL',
  'Page View URL',
  'Comparison URL',
]
csv_writer = CSVWriter.new(filename_title: "versionista_data", headers: headers)

data.each do |url, scraped_data_hash|
# show_page_url = "https://versionista.com/74487/6243163/"
# data = {
#   'Page Name' => "Developers - Data.gov",
#   'URL' => "https://versionista.com/goto?https://www.data.gov/developers/",
#   'Page View URL' => show_page_url,
#   'Comparison URL' => "https://versionista.com/74487/6243163/9633207:0/",
# }
  csv_writer.add_rows(url: url, rows: [scraped_data_hash])
end

csv_writer.write!

I expect to be able to get an MVP of this script after work tomorrow (Jan 30), however, I do not yet know how performant this script will be. Fingers cross it can run in a reasonable amount of time.

KrishnaKulkarni commented 7 years ago

Update: I worked out a functioning ruby script that successfully scraped 5 archived pages from each of 5 websites that Versionista had crawled, and spit out relevant data into a csv. I'm having @ambergman review that to determine how the output should be altered, but I hope to submit a pull request with the first version of my cleaned up script tomorrow.

trinberg commented 7 years ago

Clarifications for Output Column headers:

Here's the output/example spreadsheet again: https://docs.google.com/spreadsheets/d/1V4TAEjvcjiiTvVHlqkiCEBLS49qqDBe7iWkrKqvQBm4/edit#gid=1326237029

ambergman commented 7 years ago

Thanks so much @KrishnaKulkarni, this is really great. For the output format, see @trinberg's comment above and the shared spreadsheet.

In addition here's what we discussed (with some minor additions):

jpmckinney commented 7 years ago

@KrishnaKulkarni In terms of getting the Comparison URL for Latest to Base - Side by Side, the logic for that in JavaScript is here: https://github.com/edgi-govdata-archiving/version-tracking-ui/blob/gh-pages/browser-tool.js#L34-L59 The dropdowns aren't actually rendered in the HTML, so the code pulls the necessary data from the JavaScript.

Note that, in my version, it's performing a "latest version to last version reviewed", instead of a "latest version to first version saved" (latest to base). I don't know if that functionality is still desired - it seems to be a good thing to have, since it allows reviewers to start where they left off instead of starting fresh every time they review.

Note that the :0 in the current Comparison URL is just a shortcut for using the actual next-most-recent ID (i.e. the URLs would redirect to the same page).

titaniumbones commented 7 years ago

@KrishnaKulkarni if you have something online by 6 maybe some volunteers at civic tech can start implementing site-specific filters for say 2-3 of the most urgent-feeling domains (@tringberg @ambergman if you could create/link to gists with the most common false positives in a couple of places, we might be able to assign them to someone. If not then we can also wait on this till next week -- I think w/ NYC coming up fast it will be hard to keep track of this after say Wed., unless @jpmckinney and @geppy feel like they have time (yes??).

BTW i see there's a free tier at versionista, so maybe devs who want to hack on this can sign up for a free account, add a few relevant pages (real positives, false positives) and start hacking? Or, I guess recording only starts once you sign up... Maybe we should make a free account 3 that we can give out creds for. In any case I'm reluctant to hand the account creds out will-nilly. Last week I temporarily changed the password, but that screwed the search team over until I changed them back.