freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
378 stars 111 forks source link

Add Oregon Tax Court scraper `ortc` #1203

Closed grossir closed 1 month ago

grossir commented 1 month ago

It would use the same scraper as the recently merged or scraper We have data for ortc up to December 14th, 2011, so we would ingest 1969 opinions from then to today; and we will get the current opinions on a regular basis

Once the PR is merged, we need to tick the has_opinion_scraper flag https://www.courtlistener.com/admin/search/court/ortc/change/

Command to backscrape

manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ortc --backscrape-start 2011/12/14  --verbosity 3
grossir commented 1 month ago

The backscraper got run, we now have 1533 opinions for the mentioned time period. Note that there are duplicates, for example:

INFO Duplicate found on date: 2024-09-27, with lookup value: 7352946c76a21b1ce4f0bd0d215c0da3d9424a34
INFO Duplicate found on date: 2024-10-03, with lookup value: d1676b0110227c73ee3f02f6b693adc01da30493

These 4 documents which failed to download

https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll6/id/3584/download https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll6/id/7256/download https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll6/id/7442/download https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll6/id/4183/download

./manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ortc --backscrape-start 2012/03/25 --backscrape-end 2012/03/27  --verbosity 3

./manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ortc --backscrape-start 2012/12/30 --backscrape-end 2013/01/01  --verbosity 3

./manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ortc --backscrape-start 2016/09/12 --backscrape-end 2016/09/14  --verbosity 3

./manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ortc --backscrape-start 2018/02/01 --backscrape-end 2018/02/03  --verbosity 3

Sentry Issue: COURTLISTENER-8BP

'https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll6/id/3584/download' 'text/html' not in ['application/pdf']
grossir commented 1 month ago

Just ran the commands on the previous comments; no gaps left!