freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
341 stars 98 forks source link

Fill `tax` gaps #970

Closed grossir closed 13 hours ago

grossir commented 3 months ago

Part of #929

To help solve this, a dynamic backscraper will be implemented.

About the gap, we have 0 documents between November 20th, 2020 and January 26th, 2022. Filtering by those dates on the source, there are more than 200 docs (tried splitting the range in half, there is still more than 100 in each half)

grossir commented 1 month ago
docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.federal_special.tax --backscrape-start=11/19/2020 --backscrape-end=01/25/2022
grossir commented 1 month ago

There are 279 opinions in total in the source for that time period. We have scraped 275, after Ramiro ran the command.

I haven't found an error on Sentry, which leads me to think they were skipped for some reason on cl_scrape_opinions. However, the logger.debug calls won't show on the server (only info upwards) so I can't really tell what happened. It also seems that logger.info calls from the scraper file won't show either

Detail of the count on the source: 11/19/2020 - 04/01/2021: 92 04/01/2021 - 08/01/2021: 98 08/01/2021 - 01/25/2022: 89

mlissner commented 1 month ago

Thanks for staying on this. We'll get 'em all!

grossir commented 1 month ago

By manually checking the range 11/19/2020 - 04/01/2021, I found one document the backscraper did not collect, Memorandum Opinion for case "Kumar Rajagopalan & Susamma Kumar", dated 11/19/20 (tax links are not permanent). I downloaded it, and got the hash, which does exist on courtlistener. The same thing was happening for fla #960 , so this may be the blanket reason why we don't get exact counts. I will check a couple more for tax