Automate pulling metadata from PDF fields

datatogether / webapp

Web application to allow users to add content metadata about crawled resources

https://archivers.co/

GNU Affero General Public License v3.0

13 stars 12 forks source link

Automate pulling metadata from PDF fields #33

Open dcwalk opened 7 years ago

dcwalk commented 7 years ago

From April 18 work sesh: "The "What Climate Change Means to [STATE]" series are 50 PDF simple fact sheets with pretty good metadata already present. I'm manually using pdfinfo, which can't be right"

Additional gist: https://gist.github.com/scruss/d07545f8d71ed7ef9ca064d8e3075626

dcwalk commented 7 years ago

tools for metadata extraction: command-line: pdfinfo docs: the doc author within word, but also another cli tool (edited) this resource: http://www.forensicswiki.org/wiki/Document_Metadata_Extraction exiftool (another command line thing) claims to be able to read metadata from Word DOCX files: http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/OOXML.html