Use BeautifulSoup instead of lxml for MJML/HTML parsing

caseyjhol commented 2 years ago

Fixes #6
Fixes #10
Fixes #14
Fixes #19

Some of the upstream alignment tests were failing. I passed the source MJML back into mjml-cli locally and rebuilt the HTML, and I then updated the "-expected.html" files (some style tags were missing). All tests then passed.

FelixSchwarz commented 2 years ago

That's quite an impressive pull request which could really fix a lot of issues, thank you so much. I didn't have time to really dig into this but I'll try to split my review into smaller pieces to understand what has changed and why. Switching to Beautiful Soup might be a good solution to resolving the issues around the pretty strict lxml parsing and these issues are actually the single biggest core issue of this project right now so highly valuable.

I was wondering about failing upstream tests. So you are saying that after your Beautiful Soup migration some upstream tests were failing which did work before? And these tests started to pass once you regenerated the ...-expected.html files?

Which version of mjml (upstream) did you use?

I would love to structure the changes in a way that switching the parser does not require any changes to the tests.

What is your confidence in the correctness of your changes? Did you compare the HTML output manually against upstream HTML? Do you use the new code already in production? Once I got a chance to read up on your code I can plug it into a customer's test suite to ensure the new code did not change in unexpected ways.

Last but not least: Thank you for tackling this high-value issue. Looking forward to merging your code.

caseyjhol commented 2 years ago

Looks like it was only the HTML for the "missing features" tests that needed to be regenerated (after I moved them alongside our other test data). I used MJML v4.12.0 to generate the new HTML. These tests were already failing previously, of course, so it's possible issues with the HTML flew under the radar.

Related, I wonder if there's some way we could call mjml-cli from our tests to generate the expected HTML on the fly instead. It would be nice too, because we could then ensure we were always matching the same version of MJML in all of our tests.

I manually compared some HTML, but I mostly depended on checking if the tests passed. The changes are actually fairly simple so I'm reasonably confident there shouldn't be any issues (that didn't already exist previously).

FelixSchwarz commented 2 years ago

These tests were already failing previously, of course, so it's possible issues with the HTML flew under the radar.

That makes sense.

Related, I wonder if there's some way we could call mjml-cli from our tests to generate the expected HTML on the fly instead. It would be nice too, because we could then ensure we were always matching the same version of MJML in all of our tests.

Generating it on the fly could bring quite a few issues and makes the test suite more brittle but I have a script to regenerate all HTML files. I could add this to the git repo.

FelixSchwarz commented 2 years ago

Ok, I went through your changes and I think they are fine. I noticed that the ...-expected.html in the missing_functionality were out of date so I updated them in a separate commit fb83576. Unfortunately github's web editor can not handle that.

Would you mind rebasing your commit on top of main + squashing everything into a single commit? Bonus points for explaining the motivation the rationale of that change (something about the tickets you mentioned in your initial comment) in the commit comment.

As a final validation I'd like to plug your changes into a customer's CI pipeline to see if anything breaks there. I'll do that on Monday. After that we're ready to merge :-)

caseyjhol commented 2 years ago

@FelixSchwarz Alright - should be all set!

FelixSchwarz commented 2 years ago

Everything looked fine in our CI pipeline so I think there should not be too many surprises :-)

One thing which might be affected is that BeautifulSoup is more leniant when it comes to XML syntax so maybe we should add some (optional) XSD validation to ensure the inputs not completely off-base.

With your changes now in main there are enough new features/bug fixes to warrant a release. I plan on doing that shortly once I completed a few tasks:

24
23
22

Thank you very much, the BeautifulSoup changes much less intrusive than I expected and really solves a lot of issues I had with lxml.

FelixSchwarz / mjml-python

Use BeautifulSoup instead of lxml for MJML/HTML parsing #21

24

23

22