Create a Reader for .doc, .docx, and .pdf files

GStefanowich commented 1 year ago

There are currently two unimplemented template files (for reading .doc and .docx, and .pdf files, respectively)

[x] .odt files
[ ] .doc files
[x] .docx files
[ ] .pdf files
[ ] .ppt files
[x] .pptx files

Reading Word documents and PDF files are a bit less straight-forward than plaintext files. A library that is compatible with the license for this project may be advisible.

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/PdfReader.cs#L2-L11

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/DocumentReader.cs#L2-L11

jaimevisser commented 1 year ago

Docx is xml so e-mails should be readable using plaintext search.

GStefanowich commented 1 year ago

@jfbourke Do you want to split your OpenDocumentTextReader.cs to support .docx files?

The only difference I can see at first glance is .odt reads from content.xml and .docx reads from word/document.xml.

If you don't want to I can, but I want everyone to have a chance to contribute and your implementation works well :smiley:

jfbourke commented 1 year ago

@GStefanowich I've split the implementation to support the two file formats, added .pptx as well.

GStefanowich commented 1 year ago

@jfbourke Looks great!

HaveIBeenPwned / EmailAddressExtractor

Create a Reader for .doc, .docx, and .pdf files #46