Opinion document content has not been extracted

grossir commented 9 months ago

Using the dev DB, I identified 10197 opinions which have a download_url but no extracted content. Such opinions had a valid download URL, but none of the following content fields:

plain_text
html
html_lawbox
html_columbia
html_anon_2020

At the end of the issue I have copied the query to pick up the list of ids for correcting this issue. A custom script should be written to reprocess our backup files

Causes

As for the cause of the failed extraction, it depends on the source

Bulk insertion scripts?

99% (3028) of bia errors happen in a single day 2022-01-13, which makes me think it was an ingestion script error. The documents did have extractable text inside

Extensions we don't extract

doctor extracts text from the document on the extract/doc/text view. It accepts the following extensions only:

doc
docx
html
txt
wpd
pdf

The problem is sometimes it assigns the wrong extension to a given binary content on the utils/file/extension/ view, causing the text to not be extracted downstream.

Assigned extensions:

prefix	n
pdf/	5043
html/	1957
txt/	1694
bin/	1172
wpd/	139
p/	112
mp3/	33
doc/	17
docx/	12
wsdl/	11
mp4/	4
js/	1
obj/	1
NA	1

bin

For ncctapp and moctapp, nc, mo all the empty documents have local paths starting with bin/. Other sources with bin errors, but not at 100% are nyappdiv and ca7. Even when a local inspection shows they are pdfs, doctor is not prepared to extract .bin files and will return an error 'Unable to extract content due to unknown extension'. Example

Curiously when testing that example manually doctor returns a .pdf extension

import requests

base = 'http://127.0.0.1:8000'
url = f"{base}/extract/doc/text/"
filename = "bin/2021/07/06/state_v._medlin.bin"
file_content = requests.get("https://www.justice.gov/sites/default/files/eoir/legacy/2012/08/27/1472.pdf").content
files = {"file": (filename, file_content)}
r = requests.get(url, files=files)
r.json()
> returns 'Unable to extract content due to unknown extension'

url_extension = f"{base}/utils/file/extension/"
r = requests.post(url_extension, files=files)
> returns ".pdf"

p

The next big unrecognized extension

69 in texapp, 27 in bva, 15 in illappct, 1 in cadc

Extensions we do extract, but fail

which are the majority of the opinions without text, mainly html and pdf. These are failing silently thanks to this line which logs a warning instead of an error

A quick test against doctor is giving me a 500 error for one of the HTMLs. This (or a similar one) is a current issue in doctor's Sentry. I would expect more volume of errors, tough

import requests

base = 'http://127.0.0.1:8000'
url = f"{base}/extract/doc/text/"

file_content = requests.get("https://storage.courtlistener.com/html/2022/04/19/tay_v._green_2.html").content
filename = "html/2022/04/19/tay_v._green_2.html"
files = {"file": (filename, file_content)}

r = requests.post(url, files=files)

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 8232: invalid start byte

txt

1693 of the 1720 bva errors are txt files

Other causes

Some of the other top sources may have an explanation. The courts with lower volume of errors may be to a one-time failure on the doctor service. However, there seems to be a distinct increase of errors in late 2018

Counts by court

Most contentless records are clustered in a few sources.

court_id	count	first_date_created	last_date_created
bia	3039	2021-12-31	2022-01-27
bva	1720	2015-08-21	2021-11-03
okla	1046	2022-04-19	2023-10-24
texapp	506	2014-09-03	2023-10-14
oklacivapp	374	2022-04-14	2023-10-19
ncctapp	373	2021-04-21	2022-04-19
moctapp	371	2021-04-21	2022-03-29
arizctapp	332	2014-08-01	2022-06-02
nyappdiv	312	2018-12-26	2023-09-20
oklacrimapp	228	2022-04-14	2023-09-21
nc	216	2021-04-21	2022-03-18
del	162	2018-12-27	2020-07-27
ca7	157	2018-12-26	2023-09-07
pasuperct	138	2018-03-15	2022-08-09
nyappterm	111	2018-12-27	2020-06-29
ohioctapp	74	2018-12-26	2023-06-13
michctapp	73	2019-05-08	2022-05-13
ca6	65	2018-12-26	2021-07-26
ca4	63	2018-12-26	2019-06-20
texcrimapp	50	2016-09-29	2022-05-30
illappct	48	2015-10-22	2022-07-28
neb	39	2019-05-03	2022-04-15
ca9	38	2013-06-17	2023-08-24
mont	38	2021-01-08	2023-08-01
mo	36	2021-04-21	2022-03-15
ca5	35	2013-12-11	2020-01-31
uscfc	32	2018-12-26	2022-10-07
fla	29	2019-02-11	2019-02-11
nev	28	2018-12-27	2022-06-02
pa	25	2017-04-04	2019-11-26
ca10	24	2013-03-01	2022-07-29
calctapp	24	2018-12-26	2023-09-19
dcd	23	2018-06-12	2023-08-04
ca11	20	2018-12-26	2023-08-29
ca8	20	2018-12-26	2019-06-20
mich	20	2019-06-06	2019-06-20
ohio	17	2018-12-26	2022-05-25
sd	16	2018-12-27	2023-09-14
washctapp	16	2018-12-27	2020-06-01
ca3	14	2018-12-26	2019-06-20
delsuperct	14	2018-12-27	2023-08-18
tenncrimapp	14	2018-12-26	2022-09-21
gactapp	13	2018-12-27	2022-06-02
nysupct	13	2018-12-27	2023-07-13
tennctapp	12	2018-12-26	2022-09-06
ca2	10	2018-12-26	2019-06-20

there are more with less than 10 errors

The query used

SELECT
    id as opinion_id,
    cluster_id,
    docket_id,
    court_id,
    date_filed,
    date_created,
    docket_number,
    case_name,
    download_url,
    sha1,
    local_path,
    extracted_by_ocr
FROM
    search_opinion
INNER JOIN
    (
    SELECT id as cluster_id, docket_id, date_filed
    FROM search_opinioncluster
    ) cluster
    USING (cluster_id)
INNER JOIN
    (
        SELECT id as docket_id, court_id, source, docket_number, case_name
        FROM search_docket
    ) docket
    USING (docket_id)
WHERE
  download_url <> '' 
  AND plain_text='' AND html='' AND html_lawbox='' AND html_columbia='' AND html_anon_2020=''

mlissner commented 9 months ago

Hm, FWIW, it looks like Doctor was created in Sept. 2020, so that doesn't explain the uptick in errors around 2018. I'm trying to remember if we had major architectural changes around 2018, but I'm forgetting the history. Before doctor, I think we just had the document extraction code in CL itself, as a monolith.

Hm, not sure what triggered this, but I guess it doesn't matter. It'll be good to get this cleaned up and to prevent it from happening in the future.

sentry-io[bot] commented 9 months ago

Sentry Issue: COURTLISTENER-6S9

grossir commented 5 months ago

Ran across specific mo opinions without text. Some examples: 1, 2, 3. The original and backup files do have text

freelawproject / courtlistener