I verified http://www.sos.co.us/CRR from 'Direct Marketing Assn. v. Brohl'. When scraped, the citation was a non-404 (status 'a'). The url appears to be no longer be available. When I validated the url, the date in the Verified column was black instead of blue, which is correct--however, the status did not change from 'a' to 'u'/'r'
The page sends a Page Not Found error, but no 404, and no url redirect. Not sure how to detect this with requests library:
>>> r = requests.get('http://www.sos.co.us/CRR')
>>> r.status_code
200
>>> r.url
u'http://www.sos.co.us/CRR'
Curl will catch to URL redirect that Verizon ISP executes (same as in my local browser):
curl -I http://www.sos.co.us/CRR
HTTP/1.1 200 OK
Server: nginx/1.0.15
Date: Sun, 21 Jun 2015 23:17:10 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-control: no-cache, no-store
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Pragma: no-cache
curl http://www.sos.co.us/CRR
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh"
content="0;url=http://searchassist.verizon.com/main?
InterceptSource=0&ClientLocation=us&ParticipantID=*********************&FailureMode=1&
SearchQuery=&FailedURI=http%3A%2F%2Fwww.sos.co.us%2FCRR&AddInType=4&Version=2.1.8-
1.90base&Referer=&Implementation=0&method=GET"/><script
type="text/javascript">url="http://searchassist.verizon.com/main?
InterceptSource=0&ClientLocation=us&ParticipantID=*********************&FailureMode=1&
SearchQuery=&FailedURI=http%3A%2F%2Fwww.sos.co.us%2FCRR&AddInType=4&Version=2.1.8-
1.90base&Referer=&Implementation=0&method=GET";if(top.location!=location){var
w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,
y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url);
</script></head><body></body></html>
I verified http://www.sos.co.us/CRR from 'Direct Marketing Assn. v. Brohl'. When scraped, the citation was a non-404 (status 'a'). The url appears to be no longer be available. When I validated the url, the date in the Verified column was black instead of blue, which is correct--however, the status did not change from 'a' to 'u'/'r'
The page sends a Page Not Found error, but no 404, and no url redirect. Not sure how to detect this with requests library:
Curl will catch to URL redirect that Verizon ISP executes (same as in my local browser):