Cisco-Talos / clamav

ClamAV - Documentation is here: https://docs.clamav.net
https://www.clamav.net/
GNU General Public License v2.0
4.05k stars 676 forks source link

pdf decoding issue (mailto) #1117

Open Sanesecurity opened 7 months ago

Sanesecurity commented 7 months ago

Describe the bug

This may be connected to bug 1109 (which decodes urls fine now)

I've attached a pdf, which is a phishing attempt and has a couple of mailto links in there for gmail accounts

Can't seem to find the decoded mailto links, using --leave-temps

Using the latest ClamAV 1.3.0 rc

mailto_test.pdf

micahsnyder commented 7 months ago

I started looking into this on Friday. I also couldn't find the mailto links outside of peeking at it with a PDF reader. I also tried with pdfalyzer (https://github.com/michelcrypt4d4mus/pdfalyzer) using

rule FindMailto
{
    strings:
        $string1 = "mailto"
        $string2 = "yahoo"

    condition:
        ($string1 or $string2)
}

and running:

pdfalyze ~/Downloads/mailto_test.pdf --yara-file ~/mailto.yar

but it didn't find anything, either.

Do you have any ideas where it may be / how to find it?

Sanesecurity commented 7 months ago

I couldn't find it either. so, If you click the email address in adobe reader, it then opens a mail client, so I'm wondering if there is no mailto and it realised on Adobe working out it's an email address and adding mailto itself. I don't think I could decode the email addresses though.

I've had a few other pdf's in which for some reason I can't get a url to properly decode... attached samples, all go to the same url but different hashes... really odd.

Url is h t t ps :/ /respostrong dot com/ONEView

20231218174774.pdf 20231218305187.pdf 20231218362968.pdf 20231218637686.pdf 20231218914907.pdf

micahsnyder commented 6 months ago

I spent a bit more time on this yesterday (mostly on the mailto sample) and still came up empty handed. I don't know where the email address is stored in the file.

Sanesecurity commented 6 months ago

Thanks for the update. Doing some more digging just and well success-ish.

Using a tool called mupdf (available in various flavours at mupdf dot com...

interesting news... using the command output from...

mutool.exe trace mailto_test.pdf

outputs....

   <fill_text colorspace="DeviceRGB" color=".2 .2 .2" ri="1" bp="1" op="0" opm="0" transform="1 0 0 -1 0 841.69">
        <span font="ERYSXX+OpenSans-Bold" wmode="0" bidi="0" trm="9 0 0 9">
            <g unicode="s" glyph="86" x="46.35" y="607.09" adv=".4970703"/>
            <g unicode="i" glyph="76" x="50.822999" y="607.09" adv=".30517579"/>
            <g unicode="h" glyph="75" x="53.567998" y="607.09" adv=".65722659"/>
            <g unicode="a" glyph="68" x="59.480997" y="607.09" adv=".6040039"/>
            <g unicode="m" glyph="80" x="64.91699" y="607.09" adv=".9819336"/>
            <g unicode="z" glyph="93" x="73.745998" y="607.09" adv=".48779298"/>
            <g unicode="a" glyph="68" x="78.129" y="607.09" adv=".6040039"/>
            <g unicode="i" glyph="76" x="83.564998" y="607.09" adv=".30517579"/>
            <g unicode="d" glyph="71" x="86.31" y="607.09" adv=".6328125"/>
            <g unicode="i" glyph="76" x="91.998" y="607.09" adv=".30517579"/>
            <g unicode="3" glyph="22" x="94.743007" y="607.09" adv=".5708008"/>
            <g unicode="3" glyph="22" x="99.873" y="607.09" adv=".5708008"/>
            <g unicode="8" glyph="27" x="105.003" y="607.09" adv=".5708008"/>
            <g unicode="@" glyph="35" x="110.132999" y="607.09" adv=".89697268"/>
            <g unicode="y" glyph="92" x="118.197" y="607.09" adv=".56884768"/>
            <g unicode="a" glyph="68" x="123.309" y="607.09" adv=".6040039"/>
            <g unicode="h" glyph="75" x="128.745" y="607.09" adv=".65722659"/>
            <g unicode="o" glyph="82" x="134.65799" y="607.09" adv=".6191406"/>
            <g unicode="o" glyph="82" x="140.22899" y="607.09" adv=".6191406"/>
            <g unicode="." glyph="17" x="145.79999" y="607.09" adv=".28515626"/>
            <g unicode="c" glyph="70" x="148.36499" y="607.09" adv=".51416018"/>
            <g unicode="o" glyph="82" x="152.991" y="607.09" adv=".6191406"/>
            <g unicode="m" glyph="80" x="158.562" y="607.09" adv=".9819336"/>
        </span>
    </fill_text>

The email address is split into single chars and well, I don't understand the rest... but at least the email address is in there is in there :)

grep "@" output -B 10 -A 10 to search the output quickly...

No idea why ClamAV can't "see" address

Sanesecurity commented 6 months ago

Forgot to add github code... https://github.com/ArtifexSoftware/mupdf

maybe searching their trace code will help

micahsnyder commented 6 months ago

@Sanesecurity I've been playing with PDF extraction a bit in this commit to try to understand what's missing: https://github.com/Cisco-Talos/clamav/pull/1141/commits/f9644b8b5e9e33c3ce64c91f15fd3477bd7e645e

I also have a local change to force clam to dump all extracted objects and not just specific types. It's letting me observe a little more of the PDF contents as ClamAV sees it. It seems to me that the detail that mupdf is providing is transcoded from PDF into HTML.

For example, if I look for the "841.69" value in clamav's temp files, I see pdf obj 1 0 contains:

<</Contents 8 0 R/Type/Page/Resources<</Font<</F1 2 0 R/F2 5 0 R>>/XObject<</img1 4 0 R/img0 3 0 R>>>>/Annots[6 0 R 7 0 R]/Parent 9 0 R/MediaBox[0 0 595.42 841.69]>>

And if I search for "unicode", I find:

<</Subtype/Type0/Type/Font/BaseFont/FRNXLW+OpenSans/Encoding/Identity-H/DescendantFonts[12 0 R]/ToUnicode 13 0 R>>

That "ToUnicode 13 0" bit tells me they're hiding unicode stuff in obj 13 0. So I went looking there and I see:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TTX+0)
/Ordering (T42UV)
/Supplement 0
>> def
/CMapName /TTX+0 def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
51 beginbfrange
<0003><0003><0020>
<0006><0006><0023>
<0009><0009><0026>
<000a><000a><0027>
<0011><0011><002e>
<0013><0013><0030>
<0014><0014><0031>
<0015><0015><0032>
<0016><0016><0033>
<001d><001d><003a>
<0023><0023><0040>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003c><003c><0059>
<0044><0044><0061>
<0045><0045><0062>
<0046><0046><0063>
<0047><0047><0064>
<0048><0048><0065>
<0049><0049><0066>
<004a><004a><0067>
<004b><004b><0068>
<004c><004c><0069>
<004e><004e><006b>
<004f><004f><006c>
<0050><0050><006d>
<0051><0051><006e>
<0052><0052><006f>
<0053><0053><0070>
<0055><0055><0072>
<0056><0056><0073>
<0057><0057><0074>
<0058><0058><0075>
<0059><0059><0076>
<005c><005c><0079>
<005f><005f><007c>
<0396><0396><0049>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

I don't think this data looks like those unicode glyphs and offsets, but I don't know what it is.

Here are my temp files, if you want to take a peek. I encrypted the zip with password "clamav".

20240115_182231-mailto_test.pdf.1abd937acc.zip

micahsnyder commented 6 months ago

The email address is split into single chars and well, I don't understand the rest... but at least the email address is in there is in there :)

This is so gross. I hate PDF's.

Sanesecurity commented 6 months ago

Had a look at your clamav extracts in the zip file, drawing a blank :(

Just to add to pdf woes, I can't find the following google link in the attached pdf:

https://www dot google dot com/url?qsa=D&sntz=1&usg=AOvVaw3mfhRcSt97wxts8C4_DPfX

So, looks like some new pdf tools are being used to hide links... test2.pdf

Sanesecurity commented 6 months ago

Another pdf example, url inside pdf is:

h t t p s ://www.google.com/url?q=https%3A%2F%2Feasyjump-grey . fun%2FB8sy1hCK%23nR70eDNaiDdeYtiRi9r&sa=D&sntz=1&usg=AOvVaw15r19jUlnlIP6zNQynkdza

ClamAV rc2 latest doesn't need to decode the url...

Attached Pdf was generated using reportlab tools, which has a nice bit of code, commented with notes about encryption, which may or may not help:

https://github.com/eduardocereto/reportlab/blob/master/src/reportlab/lib/pdfencrypt.py

urlnotfound.pdf