MuckRock / documentcloud-frontend

DocumentCloud's front end source code - Please report bugs, issues and feature requests to info@documentcloud.org
https://www.documentcloud.org
GNU Affero General Public License v3.0
15 stars 5 forks source link

Document Cache affects OCR plaintext and sidebar generation #292

Closed duckduckgrayduck closed 1 year ago

duckduckgrayduck commented 1 year ago

You can test and replicate on this document: https://www.documentcloud.org/documents/23962135-p543161_muckrock_news_clear_data_foia_letterdoc which has the following OCR JSON results: https://s3.documentcloud.org/documents/23962135/p543161_muckrock_news_clear_data_foia_letterdoc.txt.json

If you see the document now, when it is public, it shows OCR: Textract, and shows the plaintext of the old OCR results that is cached. Cache2

If you then change the document to private, it updates the document and shows the correct OCR: Azure Document Intelligence and an updated plaintext (notice that the word Mayor is now above the line Department of Police, City of Chicago.

Cache1

If I however, change the document back to public, the old cache persists, and it switches back to OCR: Textract.

duckduckgrayduck commented 1 year ago

It is fixed with #298 :)