deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.92k stars 609 forks source link

Document titles #530

Open kolaente opened 2 months ago

kolaente commented 2 months ago

Some document types have a title set in their metadata, like pdf and the various office formats. The title is different from the filename.

It would be awesome if textract could also be used to extract that title. python-pptx, for example, has a method to retrieve the subject.