documentcloud / documentcloud

The DocumentCloud platform
https://www.documentcloud.org
MIT License
424 stars 162 forks source link

Reprocess docs to correct timeline positioning bug, etc. #209

Open reefdog opened 9 years ago

reefdog commented 9 years ago

See #187. As of cdd01c790c194c2762ce6c90e62cafe56476b57f we should finally be positioning timeline entity occurrence highlighting properly, but we'll need to reprocess all docs to regenerate the correct positions.

Since this will be a huge operation that we don't want to repeat, we should consider any other bugs/issues that would require reprocessing, and fix those too.

reefdog commented 9 years ago

To clarify, we don't actually have to do any regeneration of image or text assets from the original PDF, we just have to regenerate date entities from each doc's text file. The sort of task we might want to tackle at the same time are things like https://github.com/documentcloud/documentcloud/pull/143#issuecomment-71495591

knowtheory commented 9 years ago

as a heads up... the date entity recognizers are regexp based, and written in ruby. We should be able to reprocess all of the texts/snippits running code on the workers against text from the database (e.g. s3 needn't be involved in reprocessing these).

Additionally, based on the git commit log (and the commit you've already noted) you should be able to target the specific collection of documents that will need to be reprocessed.

reefdog commented 9 years ago

Chagrined to admit I didn't know we were storing the full text in the db, but of course we are. Now I'm imagining searches without that. :grin:

To the latter point, I'd thought about that, but Nathan pointed out that many docs before that commit were screwy because of the lack of UTF support in the scanner, and as far as we know those were never re-processed. May as well do them all, especially if it's "only" going to be a db job and not a massive S3 crawl, right?

anthonydb commented 9 years ago

@reefdog I uploaded a document last night with dates, and the timeline still shows the highlight offset. Can you confirm the fix?

reefdog commented 9 years ago

Can you point me to the doc?

anthonydb commented 9 years ago

https://www.documentcloud.org/documents/2085522-purcellville-town-council-agenda-may-12-2015.html

knowtheory commented 9 years ago

findable in the workspace too: https://www.documentcloud.org/search/Document:%202085522-purcellville-town-council-agenda-may-12-2015

reefdog commented 9 years ago

I've been only deploying via deploy:app rather than deploy:full, so the workers were out of date and didn't have the entity position fix. All is in sync, and I tested with this doc (reduced to a few pages for faster processing) and it worked. Tony, can you try again with the original doc? It's been removed.

anthonydb commented 9 years ago

Definitely good to go now!

knowtheory commented 9 years ago

Close it out!

reefdog commented 9 years ago

Nooo this is the reprocessing task!

knowtheory commented 9 years ago

Oh sorry

reefdog commented 9 years ago

I love this close fight we're having. It's like Christmas lights up there.

knowtheory commented 9 years ago

So, we've got this: https://github.com/documentcloud/documentcloud/blob/master/app/actions/reprocess_entities.rb

Once all the benghazi furor has died down we can fire it up.