HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Create a Reader for .doc, .docx, and .pdf files #46

Open GStefanowich opened 1 year ago

GStefanowich commented 1 year ago

There are currently two unimplemented template files (for reading .doc and .docx, and .pdf files, respectively)

Reading Word documents and PDF files are a bit less straight-forward than plaintext files. A library that is compatible with the license for this project may be advisible.

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/PdfReader.cs#L2-L11

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/DocumentReader.cs#L2-L11

jaimevisser commented 1 year ago

Docx is xml so e-mails should be readable using plaintext search.

GStefanowich commented 1 year ago

@jfbourke Do you want to split your OpenDocumentTextReader.cs to support .docx files?

The only difference I can see at first glance is .odt reads from content.xml and .docx reads from word/document.xml.

If you don't want to I can, but I want everyone to have a chance to contribute and your implementation works well :smiley:

jfbourke commented 1 year ago

@GStefanowich I've split the implementation to support the two file formats, added .pptx as well.

GStefanowich commented 1 year ago

@jfbourke Looks great!