MuckRock / muckrock

MuckRock's source code - Please report bugs, issues and feature requests to info@muckrock.com
https://www.muckrock.com
GNU Affero General Public License v3.0
114 stars 22 forks source link

Allow searching through DocumentCloud documents #5

Open morisy opened 9 years ago

morisy commented 9 years ago

We should offer a way for users to search through our documents on DocumentCloud. They offer a really nice API of their own, but I think it'd probably be better for us to pull the raw text data into our backend and allow searching through it on our main search interface.

mitchelljkotler commented 8 years ago

Concern: if we are moving raw emails to files due to size issues in the DB, do we want to have all the OCR data in the database?

morisy commented 8 years ago

I think so? Does it make sense to split it off into a separate database? Not storing/using data for the sake of having less data seems like a bad general policy if having it would be useful, but I get the confer of giant DBs. On Sun, Mar 13, 2016 at 10:33 PM mitchelljkotler notifications@github.com wrote:

Concern: if we are moving raw emails to files due to size issues in the DB, do we want to have all the OCR data in the database?

— Reply to this email directly or view it on GitHub https://github.com/MuckRock/muckrock/issues/5#issuecomment-196111899.

mitchelljkotler commented 4 years ago

Now that we have merged with DocumentCloud, we should find a way to directly search our DocumentCloud documents from the main MuckRock search.

Our anonymous user directly requested the ability to search the OCRed text of the documents - he noticed the direct PDFs are banned in robots.txt so Google does not OCR and index them, as we want people to visit the site for context instead of the PDFs directly. I'm not sure if there is a better way we could integrate this with Google, so that they will index the OCRed text and be able to search the request pages for those terms.