Open GStefanowich opened 1 year ago
Docx is xml so e-mails should be readable using plaintext search.
@jfbourke Do you want to split your OpenDocumentTextReader.cs
to support .docx
files?
The only difference I can see at first glance is .odt
reads from content.xml
and .docx
reads from word/document.xml
.
If you don't want to I can, but I want everyone to have a chance to contribute and your implementation works well :smiley:
@GStefanowich I've split the implementation to support the two file formats, added .pptx as well.
@jfbourke Looks great!
There are currently two unimplemented template files (for reading
.doc
and.docx
, and.pdf
files, respectively).odt
files.doc
files.docx
files.pdf
files.ppt
files.pptx
filesReading Word documents and PDF files are a bit less straight-forward than plaintext files. A library that is compatible with the license for this project may be advisible.
https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/PdfReader.cs#L2-L11
https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/9d9efa8cd4d74600db1aa3d63006bad760be81f0/src/Objects/Readers/DocumentReader.cs#L2-L11