arderyp / scotuswebcites

United States Supreme Count web citation discovery, presentation, and validation
GNU General Public License v3.0
1 stars 0 forks source link

Fixed url does not change url status #41

Closed arderyp closed 8 years ago

arderyp commented 8 years ago

Before thorough testing, make sure that the problem is not httpS, and me failing to fix that in validation...

{{FAIL}} OPINION: http://www.supremecourt.gov/opinions/12pdf/12-207_d18e.pdf CITATION: http://www.denverda.org/DNA_Documents/Denver%27s%20Preventable%20Crimes%20Study.pdf

This is improperly scraped to "http://www.denverda.org/DNA_Documents/". When you fix it, it should go from a black "unavailable" link to a green "available" link, and be archived. This isn't happening.

{{FAIL}} Same with "http://www.denverda.org/DNA_Documents/MarylandDNAarresteestudy.pdf" citation, which is inaccurately scraped, but fixed url is available.

{{FAIL}} OPINION: http://www.supremecourt.gov/opinions/12pdf/12-71_7l48.pdf CITATION: (second, inaccurately scraped) http://avalon.law.yale.edu/18th_century/ratny.asp

{{FAIL}} EDIT: re above, the status probably does not affect this. Here is an example where I changed an unavailable bad scrape to an available good url, without changing the status from "good scrape", and it the status remained unavailable: OPINION: http://www.supremecourt.gov/opinions/13pdf/12-682_8759.pdf CITATION: http://www.ro.umich.edu/report/12enrollmentsummary.pdf.16

{{FAIL}} adding valid ".aspx" extension to end did not change status to available or capture. OPINION: http://www.supremecourt.gov/opinions/14pdf/14-7955_aplc.pdf CITATION: http://www.law.umich.edu/special/exoneration/Pages/about.aspx

{{FAIL}} fixing string broken over multiple changes did not change to avaialble and capture: OPINION: http://www.supremecourt.gov/opinions/14pdf/14-7955_aplc.pdf CITATION: http://www.deathpenaltyinfo.org/node/5741

{{FAIL}} fixed url ddidn't change status and archive: OPINION: http://www.supremecourt.gov/opinions/14pdf/14-7955_aplc.pdf CIATIONS: https://news.vice.com/article/un-vote-against-death-penalty-highlights-global-abolitionist-trend-and-leaves-the-us-stranded

{{FAIL}} OPINION: http://www.supremecourt.gov/opinions/15pdf/14-8913_5h25.pdf CITATION: http://www.ussc.gov/sites/default/files/pdf/research-and-publications/annual-reports-and-sourcebooks/2015/FigureG.pdf


{{SUCCESS}} EXAMPLE OF WORKING CHANGED URL (INAVAILABLE->REDIRECT): This worked, and I mistakenly left it as "good scrape" instead of changing status to "bad scrape". I changed the status for the above examples. Could this be the issue? opinion: http://www.supremecourt.gov/opinions/13pdf/12-1371_6b35.pdf citation: http://www.ovw.usdoj.gov/domviolence.htm.5

{{SUCCESS}} stripping bad string off the end changed status from unavailable to redirect, and captured resource via API. OPINION: http://www.supremecourt.gov/opinions/14pdf/14-144_758b.pdf CITATION: http://www.txdmv.gov/reports-and-data/doc_download/5050–specialty-plates-revenue-fy-1994-2014

{{SUCCESS}} EDIT: set this one from bad scrape ("https://") to valid url, and it changed from "unavailable" to "redirect" successfully, after chanigng status to "bad scrape" OPINION: https://plannedparenthoodvolunteer.hire.com/viewjob.html?optlinkview=view-28592&ERFormID=newjoblist&ERFormCode=any CITATION: https://plannedparenthoodvolunteer.hire.com/viewjob.html?optlinkview=view-28592&ERFormID=newjoblist&ERFormCode=any

arderyp commented 8 years ago

It looks like a simple mistake. The default value for the status field is "a", which is assumed, but not set via code/logic when a new citation is discovered but doesn't fall into the 4XX and 3XX check/conditions. So, any citations that is scraped and given a "u" status, and whose validated to have a 2XX valid url will not have it's status corrected because there is no explicit code to set "a" for valid resources. It's a bad/lazy approach that should be fixed.