Open DFIRFRANKY opened 1 year ago
Duplicate of #801
Adding this link to Didier Stevens pdf-parser.py https://blog.didierstevens.com/programs/pdf-tools/ https://didierstevens.com/files/software/pdf-parser_V0_7_8.zip
Actually I looked into it. While extracting compressed streams from a pdf is pretty easy, actually extracting the text from the pdf is quite complex. The tool Matt linked to above just extracts the streams.
Pdfs generally don't have text in them, instead they consist of actually drawing commands to place the letters in a position on the page. These commands are encoded in many ways, some are obvious but some are not. The actual letters are sometimes encoded in terms of the font used in fact. So we can sometimes easily read the text and sometimes not.
There are some commercial solutions to extract text from pdf but the open source solutions are quite simplistic and fail frequently (also seen to be CPU intensive in my testing)
The best open source solutions seem to be in python at the moment. There are some go libraries all based on old code by Russ Cox for example https://pkg.go.dev/github.com/dslipak/pdf but in my testing these are really slow and not suitable for use on large numbers of documents.
It may be possible to write something that works some of the time but not all the time.
Hey this is great work Mike; much appreciated. I have had good experiences with PDFParser, it is also part of the "SANS toolset" and in my opinion the "best option" available. It might leave some gaps, especially when text is an image or some obfuscation levels are applied, but if we are just searching for a non-malicious document containing a string in the text like "hello my name is John", that might be pretty reliable? I would be happy to do some testing with PDF's and PDFParser (and the success rate) if that could help you out in weighing up whether it is worth implementing?
Can you please link to PDFParser? is this a different tool? I have not heard of it
apologies, it is pdf-parser...
Was there any progress on this issue? Scanning within PDFs would be a very useful feature.
Ah I completely forgot about this - I started writing an artifaft that can do a yara scan on PDF files - but I didnt have time to back test it against a set of maldocs.
https://gist.github.com/scudette/40f49fb64383eed489667ca9fade93f4
I also started writing a blog post about it but i have not gotten around to finishing it. Thanks for reminding me I will get to it soon :-). Until then feel free to play with the artifact and comment !
https://github.com/scudette/velociraptor-docs/commit/423facb7dd9af7958cf9d7ac396f91ae3da95385
This is awesome progress, thank you!
Feedback: I tested your query above and it appears to work well! It found the text in my test PDF document, but one caveat was that I had to add "ascii" to the Yara rule, otherwise it returned no results with only "wide".
rule X { strings: $a = "Secret" ascii wide nocase condition: any of them }
@scudette I did some more testing, with your custom artefact and found that I could only get search hits on text in PDF files that I had created (i.e. opened Word - added some text - exported to PDF - searched for it with VR with the Generic.Search.PDF artefact).
I tested some PDFs from other sources (e.g. downloaded books) and I also OCR scanned a page of text, but unfortunately no hits for text that was there (i.e. confirmed with the find function in Adobe Reader or extracting the text with the Python module PyPDF2).
Are you able to share some of those pdfs?
Sure thing, I’ll dig them out tomorrow.
On Tue, 20 Jun 2023 at 19:46, Mike Cohen @.***> wrote:
Are you able to share some of those pdfs?
— Reply to this email directly, view it on GitHub https://github.com/Velocidex/velociraptor/issues/2600#issuecomment-1598622649, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALB2VM25VTBIVYYQQUQ2B53XMGESXANCNFSM6AAAAAAWPKZVXE . You are receiving this because you commented.Message ID: @.***>
So I originally download the complete works of Shakespeare in PDF to test it (from https://www.booksfree.org/the-complete-works-of-william-shakespeare-pdf-free-download/).
I found a suitable word "bluntness" that only appears once, and searched for it using VR (Generic.Search.PDF) but got no hits, so I confirmed that I could find the phrase in Adobe Reader and I also ran a search using the PyPDF2 module in Python, which also found a hit.
# pip install PyPDF2
from PyPDF2 import PdfReader
import re
# reader = PdfReader("The-Complete-Works-of-William-Shakespeare-booksfree.org_.pdf")
# reader = PdfReader("Lorem ipsum - Scanned OCR.pdf")
reader = PdfReader("Lorem ipsum 13k.pdf")
total_pages = len(reader.pages)
hits = 0
search_phrase = ".*bluntness.*"
for page in range(total_pages):
page_text = reader.pages[page].extract_text()
search_match = re.search(search_phrase, page_text, re.IGNORECASE)
if search_match:
hits += 1
print(f"Hit for {search_phrase} on page {page + 1}")
# print(page_text)
print(f"Summary: {hits} hits for search phrase: {search_phrase}")
print("Finished.")
I created a couple of test PDFs, using some generated Lorem Ipsum and the word "bluntness" at the end of the document (in case it was a data size thing).
VR was able to get a hit on the documents created in Word and exported to PDF fine. (The Lorem ipsum 13k document is PDF version 1.7). Lorem ipsum 13k.pdf
I then printed a page of the lorem ipsum text with the word "bluntness" in the middle of the page, and scanned it using NAPS2 and its OCR function. Again the phrase can be found using Adobe Reader, and the Python PyPDF2 module, but unfortunately not VR.
(Lorem ipsum - printed, then scanned and OCR'd - PDF version 1.4) Lorem ipsum - Scanned OCR.pdf
And for your reference, my Yara rule in VR looks like this:
rule X {
strings:
$a = "bluntness" ascii wide nocase
condition: any of them
}
This is what I was referring to above when I mentioned the text is not simple to extract from the PDF. If you look at the velociraptor output for this file:
You can see the characters are encoded like this <0057> Tj 5 0 Td <004C>
that <0057>
is actually a reference into the font dict so it needs to be decoded before an approximation of the text can be extracted.
I guess ultimately we need to ask what we want to get out of parsing PDFs? Do we want to be able to detect things like embedded JS? embedded URLs (A lot of malware deliver phishing links in PDFs). Or do was want to be able to extract text?
Each of these features have different use cases and can be difficult to properly extract in all cases. pdf-parser.py is only able to extract and decode streams which is enough for figuring out JS or extract URLs but not enough to decode text. PyPDF2 is much more feature full and so probably we will need to add native Go support (and we have to write it ourselves since there does not seem to be a go library that is as good out there).
Love the work on this, sorry, I have been on the sideline, testing the new artefact on my 1TB of test files and had some server hardware issues with that.
In my experience, the artefact picks up some documents, which is great and the extracted text in the logs is great as well. It does seem to miss a few documents, but based on my 1TB of data, it is very hard to put my finger on which files and why; it takes days to run.
I confirm an export from Word to PDF gets picked up.
As for the use cases, I see 2;
an eDiscovery / DLP focus. This would be the most important use case. For instance, from a government perspective; are any "secret" documents located on a "lower classification" system? Another example is where a system is compromised, either by an external attacker or an insider; what data is / was sitting on it? The ability to scan office files for certain strings should include PDF's to be able to provide some assurance around that. Of course this can all be done manually, or via other tools, but having it built into Velociraptor would be amazing.
Incident response focus. Probably a harder one to define with an example. As you mentioned, it could be enough to know whether there is malware in a PDF or not, a (mass) string search would not be required for that. I can imagine though that it can be used from a Threat Intelligence perspective. A certain actor is trying to get in or has gotten in and some previous attack methods have been shared; let's see if the same pdf with that same url inside it was sent to your organisation. This is a less strong example than the above one, so I'd say the main reason to have the functionality would be for eDiscovery / DLP.
I had a quick test of existing Go modules and they have the same issue, and like you say @scudette, they work for basic PDF files, but fall down with more complex ones. Examples being:
Other solutions are commercial products, e.g. UniDoc
As @DFIRFRANKY says, I think those would be the two main use cases:
I have no doubt that you could work it out eventually, but at what cost, and whether it's worth it for the project.
For a particular use case that Velociraptor would provide; yara scanning files for strings remotely without the need to download the file from the target, it would be great to see the addition of a pdf parser (such as “pdfparser”).
Currently, it is easy to scan doc, docx, zip, txt, xls and xlsx, for string (for instance “SECRET”. This means if a large amount of endpoints need to be scanned for a leaked file or a suspicious document, it is a matter of starting a new hunt with a few clicks. This does leave a large gap however; pdf files. These need to be parsed before they cam be scanned with yara. Having a pdfparser-type option for yara scanning in Velociraptor would greatly enhance the functionality of the suite.