arderyp / scotuswebcites

United States Supreme Count web citation discovery, presentation, and validation
GNU General Public License v3.0
1 stars 0 forks source link

Status not updated when verifying? #11

Closed arderyp closed 8 years ago

arderyp commented 8 years ago

I verified http://www.sos.co.us/CRR from 'Direct Marketing Assn. v. Brohl'. When scraped, the citation was a non-404 (status 'a'). The url appears to be no longer be available. When I validated the url, the date in the Verified column was black instead of blue, which is correct--however, the status did not change from 'a' to 'u'/'r'

The page sends a Page Not Found error, but no 404, and no url redirect. Not sure how to detect this with requests library:

>>> r = requests.get('http://www.sos.co.us/CRR')
>>> r.status_code
200
>>> r.url
u'http://www.sos.co.us/CRR'

Curl will catch to URL redirect that Verizon ISP executes (same as in my local browser):

curl -I http://www.sos.co.us/CRR
HTTP/1.1 200 OK
Server: nginx/1.0.15
Date: Sun, 21 Jun 2015 23:17:10 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-control: no-cache, no-store
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Pragma: no-cache

curl http://www.sos.co.us/CRR
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"           
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh"        
content="0;url=http://searchassist.verizon.com/main?   
InterceptSource=0&ClientLocation=us&ParticipantID=*********************&FailureMode=1&
SearchQuery=&FailedURI=http%3A%2F%2Fwww.sos.co.us%2FCRR&AddInType=4&Version=2.1.8-
1.90base&Referer=&Implementation=0&method=GET"/><script 
type="text/javascript">url="http://searchassist.verizon.com/main?
InterceptSource=0&ClientLocation=us&ParticipantID=*********************&FailureMode=1&
SearchQuery=&FailedURI=http%3A%2F%2Fwww.sos.co.us%2FCRR&AddInType=4&Version=2.1.8- 
1.90base&Referer=&Implementation=0&method=GET";if(top.location!=location){var      
w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,
y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url); 
</script></head><body></body></html>
arderyp commented 8 years ago

since this site is now a 404, no way to test this. Closing.