freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
553 stars 151 forks source link

Opinion document content has not been extracted #3811

Open grossir opened 9 months ago

grossir commented 9 months ago

Using the dev DB, I identified 10197 opinions which have a download_url but no extracted content. Such opinions had a valid download URL, but none of the following content fields:

At the end of the issue I have copied the query to pick up the list of ids for correcting this issue. A custom script should be written to reprocess our backup files

Causes

As for the cause of the failed extraction, it depends on the source

Bulk insertion scripts?

99% (3028) of bia errors happen in a single day 2022-01-13, which makes me think it was an ingestion script error. The documents did have extractable text inside

Extensions we don't extract

doctor extracts text from the document on the extract/doc/text view. It accepts the following extensions only:

The problem is sometimes it assigns the wrong extension to a given binary content on the utils/file/extension/ view, causing the text to not be extracted downstream.

Assigned extensions:

prefix n
pdf/ 5043
html/ 1957
txt/ 1694
bin/ 1172
wpd/ 139
p/ 112
mp3/ 33
doc/ 17
docx/ 12
wsdl/ 11
mp4/ 4
js/ 1
obj/ 1
NA 1

bin

For ncctapp and moctapp, nc, mo all the empty documents have local paths starting with bin/. Other sources with bin errors, but not at 100% are nyappdiv and ca7. Even when a local inspection shows they are pdfs, doctor is not prepared to extract .bin files and will return an error 'Unable to extract content due to unknown extension'. Example

Curiously when testing that example manually doctor returns a .pdf extension

import requests

base = 'http://127.0.0.1:8000'
url = f"{base}/extract/doc/text/"
filename = "bin/2021/07/06/state_v._medlin.bin"
file_content = requests.get("https://www.justice.gov/sites/default/files/eoir/legacy/2012/08/27/1472.pdf").content
files = {"file": (filename, file_content)}
r = requests.get(url, files=files)
r.json()
> returns 'Unable to extract content due to unknown extension'

url_extension = f"{base}/utils/file/extension/"
r = requests.post(url_extension, files=files)
> returns ".pdf"

p

The next big unrecognized extension

69 in texapp, 27 in bva, 15 in illappct, 1 in cadc

Extensions we do extract, but fail

which are the majority of the opinions without text, mainly html and pdf. These are failing silently thanks to this line which logs a warning instead of an error

A quick test against doctor is giving me a 500 error for one of the HTMLs. This (or a similar one) is a current issue in doctor's Sentry. I would expect more volume of errors, tough

import requests

base = 'http://127.0.0.1:8000'
url = f"{base}/extract/doc/text/"

file_content = requests.get("https://storage.courtlistener.com/html/2022/04/19/tay_v._green_2.html").content
filename = "html/2022/04/19/tay_v._green_2.html"
files = {"file": (filename, file_content)}

r = requests.post(url, files=files)

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 8232: invalid start byte

txt

1693 of the 1720 bva errors are txt files

Other causes

Some of the other top sources may have an explanation. The courts with lower volume of errors may be to a one-time failure on the doctor service. However, there seems to be a distinct increase of errors in late 2018 image

Counts by court

Most contentless records are clustered in a few sources.

court_id count first_date_created last_date_created
bia 3039 2021-12-31 2022-01-27
bva 1720 2015-08-21 2021-11-03
okla 1046 2022-04-19 2023-10-24
texapp 506 2014-09-03 2023-10-14
oklacivapp 374 2022-04-14 2023-10-19
ncctapp 373 2021-04-21 2022-04-19
moctapp 371 2021-04-21 2022-03-29
arizctapp 332 2014-08-01 2022-06-02
nyappdiv 312 2018-12-26 2023-09-20
oklacrimapp 228 2022-04-14 2023-09-21
nc 216 2021-04-21 2022-03-18
del 162 2018-12-27 2020-07-27
ca7 157 2018-12-26 2023-09-07
pasuperct 138 2018-03-15 2022-08-09
nyappterm 111 2018-12-27 2020-06-29
ohioctapp 74 2018-12-26 2023-06-13
michctapp 73 2019-05-08 2022-05-13
ca6 65 2018-12-26 2021-07-26
ca4 63 2018-12-26 2019-06-20
texcrimapp 50 2016-09-29 2022-05-30
illappct 48 2015-10-22 2022-07-28
neb 39 2019-05-03 2022-04-15
ca9 38 2013-06-17 2023-08-24
mont 38 2021-01-08 2023-08-01
mo 36 2021-04-21 2022-03-15
ca5 35 2013-12-11 2020-01-31
uscfc 32 2018-12-26 2022-10-07
fla 29 2019-02-11 2019-02-11
nev 28 2018-12-27 2022-06-02
pa 25 2017-04-04 2019-11-26
ca10 24 2013-03-01 2022-07-29
calctapp 24 2018-12-26 2023-09-19
dcd 23 2018-06-12 2023-08-04
ca11 20 2018-12-26 2023-08-29
ca8 20 2018-12-26 2019-06-20
mich 20 2019-06-06 2019-06-20
ohio 17 2018-12-26 2022-05-25
sd 16 2018-12-27 2023-09-14
washctapp 16 2018-12-27 2020-06-01
ca3 14 2018-12-26 2019-06-20
delsuperct 14 2018-12-27 2023-08-18
tenncrimapp 14 2018-12-26 2022-09-21
gactapp 13 2018-12-27 2022-06-02
nysupct 13 2018-12-27 2023-07-13
tennctapp 12 2018-12-26 2022-09-06
ca2 10 2018-12-26 2019-06-20

there are more with less than 10 errors

The query used

SELECT
    id as opinion_id,
    cluster_id,
    docket_id,
    court_id,
    date_filed,
    date_created,
    docket_number,
    case_name,
    download_url,
    sha1,
    local_path,
    extracted_by_ocr
FROM
    search_opinion
INNER JOIN
    (
    SELECT id as cluster_id, docket_id, date_filed
    FROM search_opinioncluster
    ) cluster
    USING (cluster_id)
INNER JOIN
    (
        SELECT id as docket_id, court_id, source, docket_number, case_name
        FROM search_docket
    ) docket
    USING (docket_id)
WHERE
  download_url <> '' 
  AND plain_text='' AND html='' AND html_lawbox='' AND html_columbia='' AND html_anon_2020=''
mlissner commented 9 months ago

Hm, FWIW, it looks like Doctor was created in Sept. 2020, so that doesn't explain the uptick in errors around 2018. I'm trying to remember if we had major architectural changes around 2018, but I'm forgetting the history. Before doctor, I think we just had the document extraction code in CL itself, as a monolith.

Hm, not sure what triggered this, but I guess it doesn't matter. It'll be good to get this cleaned up and to prevent it from happening in the future.

sentry-io[bot] commented 9 months ago

Sentry Issue: COURTLISTENER-6S9

grossir commented 5 months ago

Ran across specific mo opinions without text. Some examples: 1, 2, 3. The original and backup files do have text