arderyp / scotuswebcites

United States Supreme Count web citation discovery, presentation, and validation
GNU General Public License v3.0
1 stars 0 forks source link

new url extensions, new rules, and unit tests #40

Open arderyp opened 8 years ago

arderyp commented 8 years ago

new extensions to add to url glue logic:

  1. shtml

Add new tests to capture examples. Download data from prod and check for records with "bad_scrape"

NEW RULES:

  1. if ends in ";" don't glue next element? See http://www.supremecourt.gov/opinions/11pdf/10-945.pdf
  2. if next element starts with '/', keep gluing: See second example with bad spacing here: http://www.supremecourt.gov/opinions/12pdf/11-465_g314.pdf