Better inline charset handling for get_html

LetThereBeDwight commented 4 days ago

Closes #171.

Changes proposed in this pull request

Aligning get_html matches with get_text to ensure that we match on the same types of content type specifications between the two, except for the obvious actual content type part differences.

andrewtimberlake commented 4 days ago

This change shouldn’t be needed, see my comment on #171

LetThereBeDwight commented 4 days ago

I've commented as well - the issue isn't with building it this way, the issue is in parsing it when it comes in this way via raw spec.

andrewtimberlake commented 4 days ago

Please give us an example in raw spec that causes this problem. The parser should break it into the content type and params

LetThereBeDwight commented 4 days ago

Okay apologies I now see what was happening (kinda) - I'm actually on a different version (0.3.1) since that's the latest in hex.

When testing like this against 0.3.1 in test/mail/parsers/rfc_2822_test.exs (apologies for the long string, using the multi-line string resulted in different parser results with really bad headers - this is what I get directly when pulling the email from s3).

  test "content-type html with explicit charset" do
    message =
      parse_email("Return-Path: <from@from.com>\r\nDate: Tue, 09 Jul 2024 12:53:58 +0000 (UTC)\r\nFrom: \"from name\" <from@from.com>\r\nReply-to: \"reply to name\" <reply_to@prokeep.com>\r\nTo: to@to.com\r\nSubject: SR0012084728 - RFQ\r\nMIME-version: 1.0\r\nContent-type: text/html; charset=UTF-8\r\nContent-transfer-encoding: quoted-printable\r\nReturn-Path: from@from.com\r\n\r\n<html><head></head><body><p>Test</p></body></html>")

    IO.inspect message, label: "Message"
    IO.inspect message.headers["content-type"], label: "Content-Type"
  end

We get the output:

Message: %Mail.Message{
  headers: %{
    "content-transfer-encoding" => "quoted-printable",
    "content-type" => "text/html; charset=UTF-8",
    "date" => ~U[2024-07-09 12:53:58Z],
    "from" => {"from name", "from@from.com"},
    "mime-version" => "1.0",
    "reply-to" => "\"reply to name\" <reply_to@prokeep.com>",
    "return-path" => "from@from.com",
    "subject" => "SR0012084728 - RFQ",
    "to" => ["to@to.com"]
  },
  body: "<html><head></head><body><p>Test</p></body></html>",
  parts: [],
  multipart: false
}
Content-Type: "text/html; charset=UTF-8"

SO - apologies for mislabling the version, which made me then go and try to update and realize that 0.4.0 isn't actually on Hex - is there a timeline for that or have you stopped pushing to hex entirely?

Granted I wouldn't have noticed these differences between get_text and get_html otherwise BUT I still maintain that those two parsers should align in any case, and the initial version of using put_content_type is also valid and generates a valid Mail.Message and email.

Understood if you don't wanna take on this PR as a result as there are other facilities here that work around the issue, but I do pretty firmly believe those two matches should align.

TLDR:

1) My apologies on not recognizing the version I had in the system I was using that ran into this issue. 2) When will 0.4.0 be pushed to hex (if ever)? 3) Do you believe that it's valid for the functions get_html and get_text to be misaligned in the way they match content type specifications? If not, this PR is valid in it's current state and that usage of put_content_type shows the issue in a rather valid way to my eye. If so, then feel free to close.

andrewtimberlake commented 3 days ago

Thanks @LetThereBeDwight I have published 0.4.0 You are right, get_text was inconsistent and I’ve changed that in #174 along with a number of doctests to check for consistency in our examples

DockYard / elixir-mail

Better inline charset handling for get_html #172

Changes proposed in this pull request