chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

Tika-Python does not parse the metadata from PDF #350

Closed Apurv3377 closed 1 year ago

Apurv3377 commented 3 years ago

Sorry for such a general issue. But I have been trying hard to extract Metadata (Author, Title, Abstract) from PDF using Tika-python client. But unfortunately, It is not able to extract any data under metadata tag. Is there anything missing?

Input PDF link

Here is my code

import tika
from tika import parser
from dicttoxml import dicttoxml
from xml.dom.minidom import parseString

tika.initVM()
parsed=parser.from_file('247.tar_1710.11035.gz_MTforGSW_black.pdf')
xml = dicttoxml(parsed['metadata'], custom_root='PDF', attr_type=False)
dom = parseString(xml)
print(dom.toprettyxml())

Metadata Output

<?xml version="1.0" ?>
<PDF>
    <Author/>
    <Content-Type>application/pdf</Content-Type>
    <Creation-Date>2020-05-30T02:21:14Z</Creation-Date>
    <Keywords/>
    <Last-Modified>2020-05-30T02:21:14Z</Last-Modified>
    <Last-Save-Date>2020-05-30T02:21:14Z</Last-Save-Date>
    <PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</PTEX.Fullbanner>
    <X-Parsed-By>
        <item>org.apache.tika.parser.DefaultParser</item>
        <item>org.apache.tika.parser.pdf.PDFParser</item>
    </X-Parsed-By>
    <key name="X-TIKA:content_handler">ToTextContentHandler</key>
    <key name="X-TIKA:embedded_depth">0</key>
    <key name="X-TIKA:parse_time_millis">53</key>
    <key name="access_permission:assemble_document">true</key>
    <key name="access_permission:can_modify">true</key>
    <key name="access_permission:can_print">true</key>
    <key name="access_permission:can_print_degraded">true</key>
    <key name="access_permission:extract_content">true</key>
    <key name="access_permission:extract_for_accessibility">true</key>
    <key name="access_permission:fill_in_form">true</key>
    <key name="access_permission:modify_annotations">true</key>
    <key name="cp:subject"/>
    <created>2020-05-30T02:21:14Z</created>
    <creator/>
    <date>2020-05-30T02:21:14Z</date>
    <key name="dc:creator"/>
    <key name="dc:format">application/pdf; version=1.5</key>
    <key name="dc:subject"/>
    <key name="dc:title"/>
    <key name="dcterms:created">2020-05-30T02:21:14Z</key>
    <key name="dcterms:modified">2020-05-30T02:21:14Z</key>
    <key name="meta:author"/>
    <key name="meta:creation-date">2020-05-30T02:21:14Z</key>
    <key name="meta:keyword"/>
    <key name="meta:save-date">2020-05-30T02:21:14Z</key>
    <modified>2020-05-30T02:21:14Z</modified>
    <key name="pdf:PDFVersion">1.5</key>
    <key name="pdf:charsPerPage">
        <item>4556</item>
        <item>4652</item>
        <item>4515</item>
        <item>5149</item>
        <item>4856</item>
        <item>4552</item>
        <item>4191</item>
        <item>3190</item>
    </key>
    <key name="pdf:docinfo:created">2020-05-30T02:21:14Z</key>
    <key name="pdf:docinfo:creator"/>
    <key name="pdf:docinfo:creator_tool">LaTeX with hyperref</key>
    <key name="pdf:docinfo:custom:PTEX.Fullbanner">This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</key>
    <key name="pdf:docinfo:keywords"/>
    <key name="pdf:docinfo:modified">2020-05-30T02:21:14Z</key>
    <key name="pdf:docinfo:producer">pdfTeX-1.40.19</key>
    <key name="pdf:docinfo:subject"/>
    <key name="pdf:docinfo:title"/>
    <key name="pdf:docinfo:trapped">False</key>
    <key name="pdf:encrypted">false</key>
    <key name="pdf:hasMarkedContent">false</key>
    <key name="pdf:hasXFA">false</key>
    <key name="pdf:hasXMP">false</key>
    <key name="pdf:unmappedUnicodeCharsPerPage">
        <item>0</item>
        <item>0</item>
        <item>0</item>
        <item>6</item>
        <item>0</item>
        <item>0</item>
        <item>0</item>
        <item>0</item>
    </key>
    <producer>pdfTeX-1.40.19</producer>
    <resourceName>b'247.tar_1710.11035.gz_MTforGSW_black.pdf'</resourceName>
    <subject/>
    <title/>
    <trapped>False</trapped>
    <key name="xmp:CreatorTool">LaTeX with hyperref</key>
    <key name="xmpTPg:NPages">8</key>
</PDF>
A-acuto commented 2 years ago

I just started using Tika and I've stumbled across the same issue. Have you find a way to solve this or not? Thanks

chrismattmann commented 1 year ago

are you sure that the PDF actually has the author attribute set? It's possible that the tool that created the PDF file didn't set this or it was e.g., missing in the environment variables and didn't get passed through, etc.

chrismattmann commented 1 year ago

not enough detail to action this. Please comment more if you have more detail. Thanks for raising this.