Open grossir opened 9 months ago
Hm, FWIW, it looks like Doctor was created in Sept. 2020, so that doesn't explain the uptick in errors around 2018. I'm trying to remember if we had major architectural changes around 2018, but I'm forgetting the history. Before doctor, I think we just had the document extraction code in CL itself, as a monolith.
Hm, not sure what triggered this, but I guess it doesn't matter. It'll be good to get this cleaned up and to prevent it from happening in the future.
Sentry Issue: COURTLISTENER-6S9
Using the dev DB, I identified 10197 opinions which have a
download_url
but no extracted content. Such opinions had a valid download URL, but none of the following content fields:At the end of the issue I have copied the query to pick up the list of ids for correcting this issue. A custom script should be written to reprocess our backup files
Causes
As for the cause of the failed extraction, it depends on the source
Bulk insertion scripts?
99% (3028) of
bia
errors happen in a single day 2022-01-13, which makes me think it was an ingestion script error. The documents did have extractable text insideExtensions we don't extract
doctor
extracts text from the document on theextract/doc/text
view. It accepts the following extensions only:The problem is sometimes it assigns the wrong extension to a given binary content on the
utils/file/extension/
view, causing the text to not be extracted downstream.Assigned extensions:
bin
For
ncctapp
andmoctapp
,nc
,mo
all the empty documents have local paths starting withbin/
. Other sources withbin
errors, but not at 100% arenyappdiv
andca7
. Even when a local inspection shows they are pdfs,doctor
is not prepared to extract.bin
files and will return an error'Unable to extract content due to unknown extension'
. ExampleCuriously when testing that example manually doctor returns a .pdf extension
p
The next big unrecognized extension
69 in
texapp
, 27 inbva
, 15 inillappct
, 1 incadc
Extensions we do extract, but fail
which are the majority of the opinions without text, mainly
html
andpdf
. These are failing silently thanks to this line which logs a warning instead of an errorA quick test against
doctor
is giving me a 500 error for one of the HTMLs. This (or a similar one) is a current issue in doctor's Sentry. I would expect more volume of errors, toughtxt
1693 of the 1720
bva
errors are txt filesOther causes
Some of the other top sources may have an explanation. The courts with lower volume of errors may be to a one-time failure on the doctor service. However, there seems to be a distinct increase of errors in late 2018
Counts by court
Most contentless records are clustered in a few sources.
there are more with less than 10 errors
The query used