Open reefdog opened 9 years ago
To clarify, we don't actually have to do any regeneration of image or text assets from the original PDF, we just have to regenerate date entities from each doc's text file. The sort of task we might want to tackle at the same time are things like https://github.com/documentcloud/documentcloud/pull/143#issuecomment-71495591
as a heads up... the date entity recognizers are regexp based, and written in ruby. We should be able to reprocess all of the texts/snippits running code on the workers against text from the database (e.g. s3 needn't be involved in reprocessing these).
Additionally, based on the git commit log (and the commit you've already noted) you should be able to target the specific collection of documents that will need to be reprocessed.
Chagrined to admit I didn't know we were storing the full text in the db, but of course we are. Now I'm imagining searches without that. :grin:
To the latter point, I'd thought about that, but Nathan pointed out that many docs before that commit were screwy because of the lack of UTF support in the scanner, and as far as we know those were never re-processed. May as well do them all, especially if it's "only" going to be a db job and not a massive S3 crawl, right?
@reefdog I uploaded a document last night with dates, and the timeline still shows the highlight offset. Can you confirm the fix?
Can you point me to the doc?
findable in the workspace too: https://www.documentcloud.org/search/Document:%202085522-purcellville-town-council-agenda-may-12-2015
I've been only deploying via deploy:app
rather than deploy:full
, so the workers were out of date and didn't have the entity position fix. All is in sync, and I tested with this doc (reduced to a few pages for faster processing) and it worked. Tony, can you try again with the original doc? It's been removed.
Definitely good to go now!
Close it out!
Nooo this is the reprocessing task!
Oh sorry
I love this close fight we're having. It's like Christmas lights up there.
So, we've got this: https://github.com/documentcloud/documentcloud/blob/master/app/actions/reprocess_entities.rb
Once all the benghazi furor has died down we can fire it up.
See #187. As of cdd01c790c194c2762ce6c90e62cafe56476b57f we should finally be positioning timeline entity occurrence highlighting properly, but we'll need to reprocess all docs to regenerate the correct positions.
Since this will be a huge operation that we don't want to repeat, we should consider any other bugs/issues that would require reprocessing, and fix those too.