freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
364 stars 108 forks source link

Bunch of DeprecationWarnings in Python3 due to invalid escape sequences #179

Open voutilad opened 7 years ago

voutilad commented 7 years ago

Some scrapers still have potentially issue-prone regex patterns that could be an issue in Py3.7+. Guess I didn't catch these before.

Simple fix is to set these string literals to raw string literals.

Finds all the $module_example* files and tests them with the sample ... /Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_appellate/ca8.py:23: DeprecationWarning: invalid escape sequence \d
  case_name_regex = re.compile('(\d{2}/\d{2}/\d{4})(.*)')
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_appellate/ca8.py:33: DeprecationWarning: invalid escape sequence \d
  case_date_regex = re.compile('(\d{2}/\d{2}/\d{4})(.*)')
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_appellate/ca8.py:41: DeprecationWarning: invalid escape sequence \d
  docket_number_regex = re.compile('(\d{2})(\d{4})(u|p)', re.IGNORECASE)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_district/dcd.py:81: DeprecationWarning: invalid escape sequence \?
  regex = re.compile('(\?)(\d+)([a-z]+)(\d+)(-)(.*)')
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_district/dcd.py:101: DeprecationWarning: invalid escape sequence \s
  judge = re.search('(by\s)(.*)', judge_string, re.MULTILINE).group(2)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_district/dcd.py:113: DeprecationWarning: invalid escape sequence \?
  regex = '(\?)(\d+)([a-z]+)(\d+)(\-)(.*)'
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/federal_special/acca_p.py:21: DeprecationWarning: invalid escape sequence \d
  self.docket_case_name_splitter = re.compile('(.*[\dX]{5,8})(.*)')
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/fla.py:22: DeprecationWarning: invalid escape sequence \d
  self.regex = re.compile("(S?C\d+-\d+)(.*)")
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/fladistctapp_3.py:72: DeprecationWarning: invalid escape sequence \d
  text = re.search('(\d{2}-\d{2}-\d{4})', text).group(1)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/fladistctapp_5.py:31: DeprecationWarning: invalid escape sequence \d
  self.case_regex = '(5D.*-.*\d{1,3})([- ]+[A-Za-z].*)'
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/miss.py:38: DeprecationWarning: invalid escape sequence \d
  date_re = re.compile('(\d{2}-\d{2}-\d{4})')
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nc.py:37: DeprecationWarning: invalid escape sequence \d
  date_cleaner = "\d+ \w+ [12][90]\d\d"
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nc.py:105: DeprecationWarning: invalid escape sequence \(
  download_url = re.search('viewopinion\("(.*)"',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nc.py:71: DeprecationWarning: invalid escape sequence \(
  'viewopinion\("(.*)"',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nc.py:130: DeprecationWarning: invalid escape sequence \d
  docket_number = re.search('(.*\d).*?',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nc.py:135: DeprecationWarning: invalid escape sequence \d
  if not re.search('^\d\d.*\d\d$', neutral_cite):
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/nd.py:47: DeprecationWarning: invalid escape sequence \d
  citation_pattern = '^.{0,5}(\d{4} ND (?:App )?\d{1,4})'
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/or.py:29: DeprecationWarning: invalid escape sequence \d
  docket_numbers.append(' & '.join(re.findall('S\d+', s)))
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/pacommwct.py:24: DeprecationWarning: invalid escape sequence \s
  self.set_regex("(.*)(?:- |et al.\s+)(\d+.*\d{4})")
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/ri_p.py:82: DeprecationWarning: invalid escape sequence \(
  regex = '(.*?)(\((\w+\s+\d+\,\s+\d+)\))(.*?)'
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/ri_p.py:101: DeprecationWarning: invalid escape sequence \s
  '(.*?)(,?\sNos?\.)(.*?)',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/ri_p.py:103: DeprecationWarning: invalid escape sequence \s
  '(.*?)(,?\s\d+-\d+(,|\s))(.*?)',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/ri_p.py:106: DeprecationWarning: invalid escape sequence \s
  '(.*?)(,?\s(?:\w+-)?\d+-\d+(,|\s))(.*?)',
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/sd.py:46: DeprecationWarning: invalid escape sequence \d
  case_name = re.search('(.*)(\d{4} S\.?D\.? \d{1,4})', s, re.MULTILINE).group(1)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states/state/sd.py:62: DeprecationWarning: invalid escape sequence \d
  neutral_cite = re.search('(.*)(\d{4} S\.?D\.? \d{1,4})', s, re.MULTILINE).group(2)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states_backscrapers/federal_district/dcd_2013.py:101: DeprecationWarning: invalid escape sequence \s
  judge = re.search('(by\s)(.*)', judge_string, re.MULTILINE).group(2)
/Users/dave/src/freelawproject/juriscraper/juriscraper/opinions/united_states_backscrapers/federal_district/dcd_2013.py:113: DeprecationWarning: invalid escape sequence \?
  regex = '(\?)(\d+)([a-z]+)(\d+)(\-)(.*)'
/Users/dave/src/freelawproject/juriscraper/juriscraper/oral_args/united_states/federal_appellate/ca3.py:20: DeprecationWarning: invalid escape sequence \d
  self.regex = '(\d{2}-\d{3,4})?(.+)\.(:?(wma)|(mp3))'
janderse commented 6 years ago

There are a lot more problems than just this when trying to run Juriscraper in Python 3. Tried running:

python3 setup.py test

Are python2 and python3 compatibility desired?

mlissner commented 6 years ago

Py3 is desired in the broad sense, but "nobody" is asking for it yet. Until CourtListener itself is Py3 ready, doing Juriscraper is good, but not a huge thing. IIRC, we turned off Travis testing for py3 a while back and with a lot of sadness.

All of that said, I'm totally in favor of and enthusiastic about Py3 compatibility, especially if, like here, it sounds fairly easy.

jcrben commented 6 years ago

in https://github.com/freelawproject/juriscraper/commit/c7b6fef5b9e7177481542be7651ac35aa5571aa3 @mlissner cited requests-mock (which now seems to be python3 over at https://github.com/jamielennox/requests-mock) and jsondate, which doesn't seem likely to get updated on its own (https://github.com/rconradharris/jsondate/issues/7)

mlissner commented 6 years ago

If jsondate is the only issue, I wonder if the easiest path here is either:

I think either should be fairly simple. I don't think we do a ton with jsondate.

Thanks for doing the digging on this, @jcrben.