datatogether / webapp

Web application to allow users to add content metadata about crawled resources
https://archivers.co/
GNU Affero General Public License v3.0
13 stars 12 forks source link

Automate pulling metadata from PDF fields #33

Open dcwalk opened 7 years ago

dcwalk commented 7 years ago

From April 18 work sesh: "The "What Climate Change Means to [STATE]" series are 50 PDF simple fact sheets with pretty good metadata already present. I'm manually using pdfinfo, which can't be right"

dcwalk commented 7 years ago

tools for metadata extraction: command-line: pdfinfo docs: the doc author within word, but also another cli tool (edited) this resource: http://www.forensicswiki.org/wiki/Document_Metadata_Extraction exiftool (another command line thing) claims to be able to read metadata from Word DOCX files: http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/OOXML.html