Open ryanelittle opened 3 years ago
The error persists even when supplying single dates.
@ryanelittle Can you share the code or CLI command that is triggering the error?
I am using Site.search_by_date in a custom class. This is my function:
def get_case_numbers(self, county, start_date, end_date):
self.county = county
self.start_date = start_date
self.end_date = end_date
self.site = Site(self.county)
self.results = self.site.search_by_date(
start_date=self.start_date,
end_date=self.end_date
)
self.case_numbers = []
for self.result in self.results:
self.case_numbers.append(self.result.number)
@ryanelittle Great. Can you also provide the date ranges you're using? Sounds like it may generally be broken, but I wouldn't mind trying to test with the exact parameters you've tried so far.
@ryanelittle oh, also if you could supply the value stored in self.county
, that'll let my replicate your test
I tried a few. None of them worked. Just tried 'ok_atoka', '2020-03-01', '2020-03-01', did not work.
@ryanelittle The bug appears to be due to the OSCN site now rejecting web requests with the default Python User-Agent supplied by the requests library. This must be new(ish) behavior, since the code was working a few months back when we created it. Anyhow, the site now treats such requests as unauthorized and returns a 403 error page, which does not contain the expected elements and therefore triggers the error we're seeing at the BeautifulSoup layer.
Providing a realistic User-Agent header appears to fix the problem. Updating the code in search.py to pass in a User-request that mimics a more realistic browser specs should fix the issue.
In the short term, if you need to press forward on your project, I would just fork and hard-code a User-Agent.
Thank you for the fix @zstumgoren.
@ryanelittle Sure thing. We'll try to ship a proper release to PyPI containing the bug fix in the near future. We'll leave this ticket open until then. Meantime, thanks for bringing it to our attention!
@zstumgoren I've used fake-useragent (https://pypi.org/project/fake-useragent/) to randomize my useragents in the past. It might be a good solution so court scraper doesn't have the same header for everyone who uses it.
I have been using Court Scraper to scrape OSCN. Counties that do not use DailyFilings will not return a list of case numbers when searching for all case numbers in a given year (start_date = 20TK-1-1, end_date = 20TK-12-31).
Looking in the code, I found this note: "Always limit query to a single filing date, to minimize chances of truncate results." I did not expect this behavior based on the documentation. Could the code be changed to behave in the same way as DailyFilings? I.E. When provided a date range, Search searches each date and provides results for a large range?