Open mgorny opened 1 month ago
@mgorny
Thanks for raising this.
The old html5lib
module is deprecated, and should not be used in new releases, hence the shift to html5lib-modern
.
Indeed it is unfortunate that pip will overwrite it when you install an older version of html5lib
. What else in your project needs the old version of html5lib
?
Edit: I just noticed its pyspelling
.
The good news is it shouldn't make any difference to the operation of rdflib. The main difference between html5lib
and html5lib-modern
is the removal of the six
dependency. But pyspelling
reintroduces that module to your dependency tree anyway. So html5lib
v1.1 should still work without issue even in RDFLib v7.1.0 release.
I was just giving an example. I'm packaging for Gentoo, so html5lib
is required by cppman, sigil, xml2rfc, beautifulsoup4, bleach, mechanize, pandas, pyspelling, weasyprint, hydrus, gpodder, buku, soupsieve, sphinxcontrib-htmlhelp, sphinxygen, textX, qtwebengine.
Unlike plain pip, Gentoo's package manager does not allow for conflicting files, so it is entirely impossible to install both packages. If you install html5lib, rdflib will fail because of missing dependency. If you install html5lib-modern, everything else will fail.
Are you planning to maintain html5lib-modern going forward? If so, please request the package name transfer on PyPI (they've recently started processing them) and update the fork's metadata to clearly indicate it is a fork and where it is located.
The reason for moving away from html5lib
to html5lib-modern
was actually for the benefit of distro packagers who were trying to package RDFlib. We had two different packagers reach out to ask us to remove the dependency on html5lib
because of its subsequent dependency on six
. The six
library is deprecated and is no longer being included in a variety of distros. The main goal here is to get six
out of the dependency tree of as many python packages as possible.
Thus, html5lib-modern
was designed to be a completely drop-in replacement for html5lib
with no code changes needed in the users python code (that's why the module name aliases).
Are you planning to maintain html5lib-modern going forward?
I'm unsure of that at this stage. I created html5lib-modern
specifically for the purpose of supporting rdflib
, so packagers can still continue to ship rdflib
without the dependency on deprecated html5lib
and six
. However after I mentioned my fork over on the html5lib
issue tracker, two distros (Nix, and one other) have already replaced html5lib
with html5lib-modern
in their package tree. It won't be seeing any enhancements or new features, but I will accept security patches if any are identified.
I'm afraid it can't be a drop-in replacement for as long as it used a different package name in metadata. For distributions, this means that either we have to patch it to change the package name, therefore make it truly compatible, and start patching packages that specify html5lib-modern
in install_requires
, or go the other way around — leave the new name in, and patch all the old packages to expect html5lib-modern
. Either way is suboptimal.
I believe the other distros have removed old html5lib entirely, and done something like Install-Provides: html5lib
when packaging html5lib-modern.
Sure, distro package-level dependencies are not the problem. However, the Python package metadata is — and pip
does not support any kind of name aliasing, so patching individual packages is the only possible solution.
@mgorny See below for my solution to resolve this, but first I admit I'm confused on some aspects of the conversation above, I need you to explain the exact issue more clearly for me.
At first I thought you were describing a Pip installation problem (you demonstrated that package name is overwritten with the old html5lib
when both are in the dependency tree). That is understandable and expected, the package name was designed to alias.
Then you said its actually a distro packaging issue, because Gentoo cannot have two different packages that install the same Python module (that's fair enough).
When I explained that other distros have completely replaced html5lib
with html5lib-modern
to provide the same dependency, you replied is not actually a distro packaging problem, but a Pip problem.
"so patching individual packages is the only possible solution."
^ That is the part I don't understand. ^
If html5lib-modern
provides the html5lib
module, and the other packages rely on any package that provides html5lib
, and the other packages use import html5lib
in their code, what parts of which packages need to be rewritten?
The solution:
I can easily update html5lib-modern codebase to use a different module name. I can publish it as html5lib-modern
with version v2.0
to indicate a breaking change. (This plan is actually already mentioned in the html5lib-modern
README.rst file.)
I'll have to communicate that change to packagers who have already picked up html5lib-modern
as a replacement for old html5lib
, so they know how to deal with that.
Ah, I'm sorry for the confusion. The problem is roughly that there are two layers to this.
First, there's the Python packaging layer, represented by .dist-info
files. If you install html5lib
, you get html5lib*.dist-info
. Then Python software can find out whether html5lib
is installed by doing e.g.:
>>> import importlib.metadata
>>> importlib.metadata.version("html5lib")
'1.1'
When you install html5lib-modern
, you get html5lib_modern*.dist-info
instead:
>>> import importlib.metadata
>>> importlib.metadata.version("html5lib-modern")
'1.2'
>>> importlib.metadata.version("html5lib")
Traceback (most recent call last):
File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 397, in from_name
return next(cls.discover(name=name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 889, in version
return distribution(distribution_name).version
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 862, in distribution
return Distribution.from_name(distribution_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 399, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for html5lib
So note that programs written to expect html5lib
won't work here. This will also trigger pip check
integrity test failures, and it may cause entry points made using pkg_resources
not to work (they actually verify installed dependencies, and refuse to run when they aren't found installed via metadata check).
Now, plain pip installs aren't really affected at this point because pip doesn't currently check for file conflicts, i.e. lets you install both packages simultaneously, one overwrite the other and have metadata for both. But as you can imagine, that's ugly and not something you should rely on (it may be fixed in the future, and if not in pip, then at least in uv which is fast replacing pip).
The second layer are distribution packages. The vast majority of distribution packaging solutions actually do check for file conflicts, and therefore don't permit installing html5lib
and html5lib-modern
packages simultaneously. Naturally, this also implies that the Python-level metadata for only one of these packages is installed.
So if we package html5lib
plain, there's no html5lib_modern*.dist-info
and packages having it in install_requires
have unsatisfied dependencies. If we package html5lib-modern
, the opposite applies. The only way here for us is to actually hack the package around to install fake metadata for both packages, and hope it works out somehow. But that's ugly.
I can easily update html5lib-modern codebase to use a different module name. I can publish it as html5lib-modern with version v2.0 to indicate a breaking change. (This plan is actually already mentioned in the html5lib-modern README.rst file.)
Well, that's one option. However, the disadvantage of that is that then users would have to have both variants (i.e. two almost identical packages) installed for a long time (possibly forever, given that some of the packages needing html5lib
are pretty much dead, and they're unlikely to switch). It would also mean we'd have to keep patching html5lib
which is also suboptimal.
It would be much better if you took over the original name and published this package as plain html5lib
. Or alternatively, find someone who would do that and incorporate your changes. I've seen some past discussion around that on html5lib
issue tracker, so there are certainly other people wanting to revive html5lib
.
@mgorny Thanks for the detailed explanation, that is clearer to me now. It didn't occur to me that programs would be checking the metadata of their installed dependencies like that, but I can see why its done.
It would be much better if you took over the original name and published this package as plain
html5lib
.
I am absolutely not going to do that. html5lib
is used by millions of users across many thousands of projects, it has a long history and is one of the most popular python libraries. I already maintain 5 different python modules that take up too much of my time, I'm not going to take on that heritage and that level of burden.
html5lib-modern
was created as a drop-in replacement for html5lib
, for RDFLib to use so we could get the v7.1.0 release out the door without six
in our dependency tree, as requested by distro packagers.
You're right that users over on the html5lib
issue tracker have been talking about reviving the project and releasing a new version for years, but there has been no movement on that yet.
Personally I think a major dependent library (eg, beautifulsoup4
or pandas
) should take it over, it can't be good for them to have an unmaintained and deprecated library in their dependency tree.
The html5lib-modern fork installs a
html5lib
package with metadata namedhtml5lib-modern
. As a result, this package is not recognized by pip as satisfying ahtml5lib
dependency. If one installsrdflib
and then another package requiringhtml5lib
, pip will overwrite thehtml5lib
Python package withhtml5lib == 1.1
, but at the same time preserve the metadata claiming thathtml5lib-modern == 1.2
is installed.To reproduce, you can create a fresh venv and try e.g.: