ArchitecturalKnowledgeAnalysis / EmailDatasetBrowser

Application for interacting with datasets produced by the EmailIndexer.
MIT License
3 stars 1 forks source link

Added general solution for HTML detection #2

Closed wmeijer221 closed 2 years ago

wmeijer221 commented 2 years ago

Whilst browsing the engine, I noticed that a bunch of mails containing HTML remained unrecognized (e.g. because they started with <!DOCTYPE html> instead of <html>), making them effectively unreadable. I took this blog's solution (link) and added it into the tool. The build still works and as far as I'm aware all HTML is loaded properly now.

wmeijer221 commented 2 years ago

I did just realize that this solution might be "over aggressive" in terms of what it deems HTML or not (most likely, most of the plain text will be considered HTML as well). I don't know the project well enough to know if this will introduce bugs. I've been using my version for a bit now, and haven't experienced any issues, so it seems to be okay(?).

andrewlalis commented 2 years ago

I'm all for improving the ability to detect HTML content. Plain-text being rendered as HTML should not be an issue, since the view is configured to use a monospace font. In the future, I may consider adding some sort of more general-purpose web rendering engine for a more complete solution, but as long as we can capture the majority of HTML emails like this, I'm satisfied to just use my browser to view the rest.