Improve typing - Githubissues

xoudini commented 4 months ago

The goal of this branch is to add complete typing to the package, for two main reasons:

improve developer experience for consumers of this package
catch bugs (while working on this package) that a type checker would've caught

I've started to do some exploration within the code base, with the initial goal to pass mypy checks in non-strict mode, and the ultimate goal to satisfy mypy in strict mode. There's a GitHub Actions workflow for tracking progress in that regard.

Some downsides:

This restricts the minimum Python version to 3.8, although this shouldn't be a big issue, considering Python 3.7 is EOL as of mid-2023.
In order to support Python 3.8, the typing is restricted to the 3.8 style, which is a bit more limited than with newer versions, but that's not really a problem.
The dependency on dotmap had to be replaced with standard dicts, at least temporarily, in order to avoid a bunch of # type: ignore comments. It may be possible to re-introduce it (or another similar package) once the type check passes.

Let me know if you're interested in me pursuing this goal!

FelixSchwarz commented 4 months ago

Thank you very much for the PR. I think this is a good addition.

Personally I use static typing only very selectively so my personal preference would be to use a non-strict mode just so we can benefit from some type checks without "cluttering" the whole code base.

This restricts the minimum Python version to 3.8, although this shouldn't be a big issue, considering Python 3.7 is EOL as of mid-2023.

Personally I'm not paying too much attention to CPython's EOL dates. To me the RHEL life cycle is way more important. For me, Python 2.7 just went EOL as RHEL 7 support ended a week ago. Python 3.6 is still the default for RHEL 8 and is supported until 2029. The good news is that RHEL 9 ships CPython 3.9 and we do not have RHEL 8 in production so I think I'm fine with a Python 3.9 baseline.

Disclaimer: These are just my personal preferences. If there are other contributors willing to invest their own time, I'm open to drop older versions/adapt my personal preferences. @caseyjhol let me know if you like to keep support for Python 3.6-3.8.

The dependency on dotmap had to be replaced with standard dicts, at least temporarily, in order to avoid a bunch of # type: ignore comments. It may be possible to re-introduce it (or another similar package) once the type check passes.

Could you explain a bit more why we have to drop dotmap? That is something that will cause some issues for our integration with mjml so I would really like to keep it. I would really like to keep attribute access for returned instances. (Side note: To me this duck typing is a big strength of Python which I don't want to give up.)

Please note I also got push access for dotmap so we can improve the code there if necessary.

xoudini commented 4 months ago

Personally I use static typing only very selectively so my personal preference would be to use a non-strict mode just so we can benefit from some type checks without "cluttering" the whole code base.

I wouldn't say strict mode necessarily "clutters" the code base (based on my experience from previous projects I've added typing support to), but I'll for sure start with adhering to non-strict mode, and we can see where to go from there then.

Why I normally prefer strict mode is that every function must be typed, which itself adds a bit more confidence in the implementations of functions.

This restricts the minimum Python version to 3.8, although this shouldn't be a big issue, considering Python 3.7 is EOL as of mid-2023.

Personally I'm not paying too much attention to CPython's EOL dates. To me the RHEL life cycle is way more important. For me, Python 2.7 just went EOL as RHEL 7 support ended a week ago. Python 3.6 is still the default for RHEL 8 and is supported until 2029. The good news is that RHEL 9 ships CPython 3.9 and we do not have RHEL 8 in production so I think I'm fine with a Python 3.9 baseline.

Disclaimer: These are just my personal preferences. If there are other contributors willing to invest their own time, I'm open to drop older versions/adapt my personal preferences. @caseyjhol let me know if you like to keep support for Python 3.6-3.8.

Ah, gotcha. I guess we can try to keep supporting Python 3.6-3.7, and just not run mypy for these versions. I believe typing_extensions>=4.0.0 requires Python 3.8 though, so we'd have to conditionally install typing_extensions>=3.0.0,<4.0.0 for older runtimes.

I'm not super familiar with setuptools (I'm mostly using Poetry nowadays), so I might require some assistance with making sure the project files are configured correctly.

Could you explain a bit more why we have to drop dotmap? [...] I would really like to keep attribute access for returned instances. (Side note: To me this duck typing is a big strength of Python which I don't want to give up.)

I dropped it at least for the time being, in order to better grasp what is accessed from where (i.e. avoiding a bunch of Any attributes), but in the end I don't mind adding it back, if deemed necessary.

That is something that will cause some issues for our integration with mjml so I would really like to keep it.

Could you expand a bit on this? On the current version of the branch I replaced the return type of the main mjml_to_html to a custom namedtuple to preserve attribute access — you'll see that all tests are passing and no tests were modified thus far.

Please note I also got push access for dotmap so we can improve the code there if necessary.

Nice! The main issue here was that dotmap is entirely untyped. Initially I started investigating how much work it would be to add type hints to dotmap as well, but ended up foregoing that plan for the time being, since:

the current version still supports Python 2.7 (and many of the method calls are still based on 2.7 conventions)
I'm not sure typing would add significant value (although making the main DotMap class generic over TypedDict could potentially be a nice addition)

xoudini commented 4 months ago

As a side note, I've also experimented with rewriting the parser from scratch using only Python built-ins:

pyexpat instead of bs4 for MJML parsing
- this also significantly reduces the amount of code required for the parser
xml.etree.ElementTree for the json_to_xml function
- the ElementTree.tostring function has a built-in HTML mode

Right now both of these functions in this package mostly mirror the reference implementation, but could be greatly simplified if manual parsing and tree building were avoided (i.e. reducing custom string interpolation which may or may not be correct). If you don't mind, I'd replace the existing implementations for these functions as well.

FelixSchwarz commented 4 months ago

just a quick reply in between meetings:

Ah, gotcha. I guess we can try to keep supporting Python 3.6-3.7, and just not run mypy for these versions.

If Python 3.6 can be supported easily, it would be nice but don't spend much time on that. As I mentioned for my own use cases a baseline of 3.9 is good enough now that RHEL 7 is EOL.

the current version still supports Python 2.7 (and many of the method calls are still based on 2.7 conventions)

No, dotmap 1.3.25 started using f-strings so they dropped Python 2.7 compatibility. (That was before I had push access.) Don't worry too much about old versions here.

As a side note, I've also experimented with rewriting the parser from scratch using only Python built-ins:

Better/simpler parsing would be welcome for sure. There are some edge cases which bit me in the past before we used beautifulsoup but I'm absolutely open to simpler implementations.

xoudini commented 4 months ago

If Python 3.6 can be supported easily, it would be nice but don't spend much time on that. As I mentioned for my own use cases a baseline of 3.9 is good enough now that RHEL 7 is EOL.

Alright: I'll work on this as if 3.8 is the minimum version (it's no overhead over 3.9), and we can circle back to 3.6 later, if deemed necessary.

No, dotmap 1.3.25 started using f-strings so they dropped Python 2.7 compatibility. (That was before I had push access.) Don't worry too much about old versions here.

You're right, didn't notice that, and I simply looked at the 2.7-esque function names like iteritems and viewitems, which are simply items as of Python 3.0. I can take another look at adding types to dotmap once I'm a bit further along with this branch!

Better/simpler parsing would be welcome for sure. There are some edge cases which bit me in the past before we used beautifulsoup but I'm absolutely open to simpler implementations.

It'd be interesting to hear about some of these edge cases, if you happen to remember any? I think the potential vulnerabilities listed here don't really apply for valid MJML, and I had to do some escaping to pass the tests for the official parser anyways.

FelixSchwarz commented 4 months ago

It'd be interesting to hear about some of these edge cases, if you happen to remember any?

I think we had two issues with the initial parser:

How do you plan to proceed with this PR? Do you want to add more changes or are you waiting on more detailed feedback? Also we can consider splitting up the PR and merging the improvements incrementally.

xoudini commented 4 months ago

I think we had two issues with the initial parser:

HTML entities such us or © are not supported

This seems to be a non-issue with my implementation, or at least I added for the the following MJML, and it appears to parse correctly:

<mjml>
  <mj-body>
    <mj-section>
      <mj-button href="/">a&nbsp;b</mj-button>
      <mj-text> x&copy;y </mj-text>
    </mj-section>
  </mj-body>
</mjml>

Whitespace gets stripped between text & link inside

...and this is something I've been thinking about, but currently all whitespace is unaffected, so e.g.

<mj-text>
  this is a string
</mj-text>

...would be parsed to a node with the inner content:

"\n  this is a string\n"

This is something that the HTML rendering engine handles in its own way anyways, and I'm not entirely convinced that it would be worth it to implement an entire "whitespace minimiser" anyways (since even this doesn't apply to all elements, such as <pre>).

How do you plan to proceed with this PR? Do you want to add more changes or are you waiting on more detailed feedback?

I've been experimenting with a "from-scratch" minimal implementation in order to figure out the best approach to achieving the best type coverage, so it'll take at least to the end of this week for me to have some MVP of that which could eventually lead to a fully typed version of this package.

Also we can consider splitting up the PR and merging the improvements incrementally.

Definitely: I could open a smaller PR with stubs for the public interface within the next few days, at least to resolve #50, and after that gradually continue on this PR to improve the internal type coverage, if that sounds reasonable to you?

caseyjhol commented 4 months ago

@xoudini Nice work here! Thanks for the contribution.

I'm okay with dropping support of older versions of Python.

What are the advantages to moving away from BeautifulSoup? If there are performance benefits, I'd be curious to see if we could quantify them.

My vote would be to continue using the well-maintained BeautifulSoup as a dependency instead of adding another thing for us to have to maintain ourselves. If we're going to attempt to reinvent the wheel, I want to make sure there's enough advantage to doing so.

xoudini commented 4 months ago

I'm okay with dropping support of older versions of Python.

Great! Then I'll treat Python 3.8 as the minimum baseline for now.

What are the advantages to moving away from BeautifulSoup? If there are performance benefits, I'd be curious to see if we could quantify them.

Mainly just one less dependency to worry about, but certainly it may have its place. My (potentially naïve) view on the topic is that HTML (and even more so MJML, apart from whatever is inside mj-raw elements) should be a subset of XML, and thus the builtin XML parser(s) in Python should be able to handle it without significant issues.

Performance considerations haven't been on the top of my list thus far, but I could certainly do some benchmarking later.

My vote would be to continue using the well-maintained BeautifulSoup as a dependency instead of adding another thing for us to have to maintain ourselves. If we're going to attempt to reinvent the wheel, I want to make sure there's enough advantage to doing so.

Agreed. Based on previous discussion, I'll keep the parser as-is for now, and we can re-open that can of worms in a separate PR (exclusively opened for that purpose).

FelixSchwarz / mjml-python

Improve typing #57