digitalsleuth / peepdf-3

A Python 3 upgrade to Peepdf
GNU General Public License v3.0
12 stars 2 forks source link

Problem with json export #20

Open lesouriciergris opened 6 months ago

lesouriciergris commented 6 months ago

Hi,

With a panel of more than 5.000 PDFs to analyse, we have identified another problem with some PDFs (sample attached) . This sample is a Microsoft Excel file, modified and exported to PDF.

essai.pdf

With peepdf -j essai.pdf :

[!] Error: Exception while generating the JSON report

Traceback (most recent call last):
  File "/home/crc/.local/lib/python3.10/site-packages/peepdf/peepdf.py", line 33
2, in main
    jsonReport = getPeepJSON(statsDict, VERSION)
  File "/home/crc/.local/lib/python3.10/site-packages/peepdf/PDFUtils.py", line
668, in getPeepJSON
    ids[idx] = ids[idx].split(f"Version {idx}: ")[1]
IndexError: list index out of range

The error seems to come from <id0>Version 1:, export in json is expecting <id0>Version 0:

With peepdf -x essai.pdf everything works :

<peepdf_analysis version="3.0.3" url="https://github.com/digitalsleuth/peepdf-3" author="Jose Miguel Esparza and Corey Forman">
  <date>2024-03-25 15:14:52</date>
  <basic>
    <filename>essai.pdf</filename>
    <md5>0e65ea47e9b51b80b22f0dbb2b8b8856</md5>
    <sha1>7c20c50251e7a2789fe7d4dd4607d8f27e0f400a</sha1>
    <sha256>2f3d70ae88d35a17a8d654de8a257c687abce28b6aa376f9c0b96fc913908458</sha256>
    <size>33234</size>
    <id0>Version 1: [ &lt;54B84DA444D598449A57423BE8579728&gt; &lt;54B84DA444D598449A57423BE8579728&gt; ]</id0>
    <detection/>
    <pdf_version>1.7</pdf_version>
    <binary status="true"/>
    <linearized status="false"/>
    <encrypted status="false"/>
    <updates>1</updates>
    <num_objects>30</num_objects>
    <num_streams>5</num_streams>
    <comments>0</comments>
    <errors num="2">
      <error_message>EOL not found</error_message>
      <error_message>No indirect objects found in the body</error_message>
    </errors>
  </basic>
  <advanced>
    <version num="0" type="original">
      <catalog object_id="1"/>
      <info object_id="3"/>
      <objects num="30">
        <object id="1" compressed="false"/>
        <object id="2" compressed="false"/>
        <object id="3" compressed="false"/>
        <object id="4" compressed="false"/>
        <object id="5" compressed="false"/>
        <object id="6" compressed="false"/>
        <object id="7" compressed="false"/>
        <object id="8" compressed="false"/>
        <object id="9" compressed="false"/>
        <object id="10" compressed="true"/>
        <object id="11" compressed="true"/>
        <object id="12" compressed="true"/>
        <object id="13" compressed="true"/>
        <object id="14" compressed="true"/>
        <object id="15" compressed="true"/>
        <object id="16" compressed="true"/>
        <object id="17" compressed="true"/>
        <object id="18" compressed="true"/>
        <object id="19" compressed="true"/>
        <object id="20" compressed="true"/>
        <object id="21" compressed="false"/>
        <object id="22" compressed="true"/>
        <object id="23" compressed="true"/>
        <object id="24" compressed="true"/>
        <object id="25" compressed="true"/>
        <object id="26" compressed="false"/>
        <object id="27" compressed="false"/>
        <object id="28" compressed="false"/>
        <object id="29" compressed="false"/>
        <object id="30" compressed="false"/>
      </objects>
      <streams num="5">
        <stream id="5" xref_stream="false" object_stream="false" encoded="true"/>
        <stream id="21" xref_stream="false" object_stream="true" encoded="true"/>
        <stream id="27" xref_stream="false" object_stream="false" encoded="true"/>
        <stream id="28" xref_stream="false" object_stream="false" encoded="false"/>
        <stream id="30" xref_stream="true" object_stream="false" encoded="true"/>
      </streams>
      <js_objects/>
      <suspicious_elements>
        <triggers>
          <trigger name="/Names">
            <container_object id="13"/>
          </trigger>
        </triggers>
      </suspicious_elements>
      <suspicious_urls/>
    </version>
    <version num="1" type="update">
      <catalog object_id="1"/>
      <info object_id="3"/>
      <objects num="0"/>
      <streams num="0"/>
      <js_objects/>
      <suspicious_elements/>
      <suspicious_urls/>
    </version>
  </advanced>
</peepdf_analysis>
digitalsleuth commented 6 months ago

Hi @lesouriciergris , thanks for identifying this. I'll look at this as well. Cheers!

digitalsleuth commented 6 months ago

@lesouriciergris , I've identified the issue and it has been resolved, but this fix will be released once the other issue you raised is resolved.

lesouriciergris commented 6 months ago

Great , thanks a lot

Good luck for the other issue.

kandji-alex commented 5 months ago

I belive I've ran into the same issue, thanks for fixing!

digitalsleuth commented 2 months ago

Hi @kandji-alex This issue has been identified and has been resolved in the next upcoming release. I'm currently doing some linting and will be releasing this in the next 24 hours.

Cheers!

digitalsleuth commented 2 months ago

Hi @kandji-alex and @lesouriciergris , these issues are now fixed in the latest release, v4.0.0. Sorry for the delay!

Cheers!