UChicago-Coase-Sandor / pacer_lib

http://pacer-lib.readthedocs.org/
9 stars 11 forks source link

Occasional Erroring when pulling documents #16

Closed synsypa closed 9 years ago

synsypa commented 9 years ago

Following error occurs on some document requests, I'm not familiar with the machinery of this section to fix it: Traceback (most recent call last): File "document_pull_012815.py", line 14, in s.download_document(case_num, doc_num, doc_link) File "/cygdrive/u/My Documents/Projects/Hubbard/Twiqbal/scrape_012815/scraper.py", line 537, in download_document output = self.request_document(case_filename, document_link) File "/cygdrive/u/My Documents/Projects/Hubbard/Twiqbal/scrape_012815/scraper.py", line 384, in request_document document = self.br.post(document_url, data=payload) File "/usr/lib/python2.7/site-packages/requests-2.5.1-py2.7.egg/requests/sessions.py", line 504, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/usr/lib/python2.7/site-packages/requests-2.5.1-py2.7.egg/requests/sessions.py", line 447, in request prep = self.prepare_request(req) File "/usr/lib/python2.7/site-packages/requests-2.5.1-py2.7.egg/requests/sessions.py", line 378, in prepare_request hooks=merge_hooks(request.hooks, self.hooks), File "/usr/lib/python2.7/site-packages/requests-2.5.1-py2.7.egg/requests/models.py", line 303, in prepare self.prepare_url(url, params) File "/usr/lib/python2.7/site-packages/requests-2.5.1-py2.7.egg/requests/models.py", line 360, in prepare_url "Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL u'/cgi-bin/show_multidocs.pl?caseid=102578&arr_de_seq_nums=8&magic_num=&pdf_header=&hdr=&pdf_toggle_possible=&caseid=102578&zipit=&magic_num=&arr_de_seq_nums=8&got_warning=&create_roa=&create_appendix=&bates_format=&dkt=': No schema supplied. Perhaps you meant http:///cgi-bin/show_multidocs.pl?caseid=102578&arr_de_seq_nums=8&magic_num=&pdf_header=&hdr=&pdf_toggle_possible=&caseid=102578&zipit=&magic_num=&arr_de_seq_nums=8&got_warning=&create_roa=&create_appendix=&bates_format=&dkt=?

zhangchuck commented 9 years ago

It looks like the URL for the documents is occasionally a relative link rather than an absolute link?

e.g., usually it's http://yahoo.com/query/ but sometime it's just "/query/"

synsypa commented 9 years ago

Added handling for relative links in download_document if "http" in doc_link: document_link = doc_link else: document_link = ("https://ecf." + court_short_id + ".uscourts.gov"