freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
341 stars 98 forks source link

Fill `nc` and `ncctapp` gaps #964

Open grossir opened 3 months ago

grossir commented 3 months ago

Part of #929

On giving this a second look, I notice that we are missing records from the end of the year because of the way the scraper creates the url self.url = "http://appellate.nccourts.org/opinions/?c=sc&year=%s" % date.today().year which will miss opinions near the change of year, which seem to be posted in early january, but put under the previous year link

For example, for year 2021 we are missing all nc records under the Mandate: 6 January 2022 section

image

This gap could be filled by simply running the current backscraper which will try to download everything again

nc

Between September 25, 2020 and February 05, 2021 we have 0 documents. We are missing around 50 published opinions from late 2020.

ncctapp

From November 4, 2020 to January 1st, 2022 we have 0 documents. There is data in the source for this time period

grossir commented 1 month ago
docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.nc --backscrape-start=2020 --backscrape-end=2021

docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.ncctapp --backscrape-start=2020 --backscrape-end=2022